[2026-03-25 14:21:44,780][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:21:47,467][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:21:47,473][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:21:52,250][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:24:37,504][__main__][INFO] - Starting iteration 0. [2026-03-25 14:24:37,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:24:37,511][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:24:44,084][__main__][INFO] - Number of regex retries in iteration 0: 0 [2026-03-25 14:24:44,085][__main__][INFO] - agents played in iteration 0 are Alice, Bob [2026-03-25 14:24:44,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.42%, Block Peak % of device VRAM: 18.66%, ΔTime: 00:00:00 [2026-03-25 14:24:44,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.42%, Block Peak % of device VRAM: 18.66%, ΔTime: 00:00:00 [2026-03-25 14:24:44,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:24:44,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:24:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:24:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:24:47,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:24:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:24:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:24:49,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:24:50,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:24:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:24:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:24:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:24:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:24:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:24:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:24:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:24:55,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:24:56,524][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:24:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:24:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:24:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:24:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:24:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:25:00,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:25:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:25:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:25:02,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:25:03,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:25:03,755][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:25:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:25:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:25:05,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:25:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:25:07,044][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:25:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:25:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:25:09,016][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:25:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:25:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:25:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:25:11,646][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:25:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:25:12,959][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:25:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:25:14,548][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:25:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:25:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:25:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:25:17,179][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:25:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:25:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:25:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:25:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:25:20,468][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:25:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:25:21,782][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:25:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:25:23,097][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:25:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:25:24,411][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:25:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:25:25,726][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:25:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:25:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:25:27,701][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:25:28,358][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:25:29,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:25:29,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.51%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.14%, ΔTime: 00:00:44 [2026-03-25 14:25:31,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:25:31,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:25:31,074][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:25:32,575][__main__][INFO] - Iteration 1 took 55s (11.94% Gen, 85.33% Train). Generation: 6s, Training: 46s. Estimated remaining time: 15h 13m 37s. Estimated total time: 15h 17m 46s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 53s. [2026-03-25 14:25:32,578][__main__][INFO] - Starting iteration 1. [2026-03-25 14:25:32,581][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:25:32,582][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:25:37,597][__main__][INFO] - Number of regex retries in iteration 1: 0 [2026-03-25 14:25:37,599][__main__][INFO] - agents played in iteration 1 are Alice, Bob [2026-03-25 14:25:38,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:25:38,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:25:38,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:25:38,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:25:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:25:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:25:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:25:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:25:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:25:42,166][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:25:42,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:25:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:25:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:25:44,795][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:25:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:25:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:25:46,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:25:47,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:25:48,084][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:25:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:25:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:25:50,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:25:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:25:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:25:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:25:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:25:53,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:25:54,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:25:54,679][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:25:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:25:55,996][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:25:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:25:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:25:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:25:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:25:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:25:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:26:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:26:01,257][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:26:01,915][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:26:02,572][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:26:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:26:03,887][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:26:04,544][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:26:05,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:26:05,862][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:26:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:26:07,178][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:26:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:26:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:26:09,150][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:26:09,807][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:26:10,768][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:26:11,426][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:26:12,084][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:26:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:26:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:26:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:26:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:26:15,372][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:26:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:26:16,689][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:26:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:26:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:26:18,662][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:26:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:26:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:26:20,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:26:21,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:26:22,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:26:23,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:26:23,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:26:23,224][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:26:24,437][__main__][INFO] - Iteration 2 took 51s (9.68% Gen, 87.98% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 19m 16s. Estimated total time: 14h 24m 16s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 8s. [2026-03-25 14:26:24,439][__main__][INFO] - Starting iteration 2. [2026-03-25 14:26:24,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:26:24,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:26:29,448][__main__][INFO] - Number of regex retries in iteration 2: 0 [2026-03-25 14:26:29,449][__main__][INFO] - agents played in iteration 2 are Alice, Bob [2026-03-25 14:26:30,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:26:30,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:26:30,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:26:30,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:26:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:26:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:26:32,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:26:32,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:26:33,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:26:34,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:26:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:26:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:26:36,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:26:36,716][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:26:37,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:26:38,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:26:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:26:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:26:40,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:26:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:26:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:26:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:26:42,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:26:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:26:43,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:26:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:26:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:26:45,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:26:46,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:26:47,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:26:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:26:48,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:26:49,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:26:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:26:50,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:26:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:26:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:26:52,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:26:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:26:53,817][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:26:54,474][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:26:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:26:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:26:56,450][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:26:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:26:57,767][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:26:58,425][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:26:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:26:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:27:00,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:27:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:27:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:27:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:27:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:27:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:27:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:27:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:27:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:27:06,619][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:27:07,277][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:27:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:27:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:27:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:27:09,909][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:27:10,567][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:27:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:27:11,885][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:27:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:27:13,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:27:13,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:27:14,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:27:14,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:27:14,962][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:27:16,324][__main__][INFO] - Iteration 3 took 51s (9.65% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 18m 50s. Estimated total time: 14h 24m 43s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 21s. [2026-03-25 14:27:16,327][__main__][INFO] - Starting iteration 3. [2026-03-25 14:27:16,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:27:16,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:27:21,442][__main__][INFO] - Number of regex retries in iteration 3: 0 [2026-03-25 14:27:21,443][__main__][INFO] - agents played in iteration 3 are Alice, Bob [2026-03-25 14:27:22,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:22,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:22,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:27:22,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:27:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:27:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:27:24,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:27:24,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:27:25,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:27:26,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:27:26,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:27:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:27:27,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:27:28,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:27:29,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:27:29,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:27:30,603][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:27:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:27:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:27:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:27:33,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:27:33,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:27:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:27:35,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:27:35,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:27:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:27:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:27:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:27:38,497][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:27:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:27:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:27:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:27:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:27:41,792][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:27:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:27:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:27:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:27:44,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:27:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:27:45,738][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:27:46,396][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:27:47,055][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:27:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:27:48,375][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:27:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:27:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:27:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:27:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:27:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:27:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:27:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:27:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:27:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:27:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:27:55,925][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:27:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:27:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:27:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:27:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:27:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:27:59,875][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:28:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:28:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:28:01,857][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:28:02,517][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:28:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:28:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:28:04,500][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:28:05,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:28:05,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:28:07,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:28:07,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:28:07,140][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:28:08,427][__main__][INFO] - Iteration 4 took 52s (9.81% Gen, 87.71% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 21m 33s. Estimated total time: 14h 28m 17s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 8s. [2026-03-25 14:28:08,429][__main__][INFO] - Starting iteration 4. [2026-03-25 14:28:08,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:28:08,433][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:28:13,797][__main__][INFO] - Number of regex retries in iteration 4: 0 [2026-03-25 14:28:13,799][__main__][INFO] - agents played in iteration 4 are Alice, Bob [2026-03-25 14:28:14,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:28:14,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:28:14,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:28:14,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:28:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:28:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:28:16,379][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:28:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:28:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:28:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:28:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:28:19,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:28:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:28:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:28:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:28:22,303][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:28:22,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:28:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:28:24,277][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:28:24,934][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:28:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:28:26,250][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:28:26,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:28:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:28:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:28:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:28:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:28:30,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:28:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:28:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:28:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:28:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:28:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:28:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:28:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:28:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:28:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:28:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:28:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:28:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:28:38,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:28:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:28:40,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:28:40,731][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:28:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:28:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:28:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:28:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:28:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:28:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:28:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:28:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:28:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:28:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:28:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:28:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:28:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:28:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:28:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:28:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:28:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:28:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:28:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:28:54,204][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:28:54,863][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:28:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:28:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:28:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:28:57,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:28:58,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:28:59,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:28:59,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:28:59,403][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:29:00,707][__main__][INFO] - Iteration 5 took 52s (10.26% Gen, 87.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 23m 39s. Estimated total time: 14h 31m 15s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 37s. [2026-03-25 14:29:00,710][__main__][INFO] - Starting iteration 5. [2026-03-25 14:29:00,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:29:00,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:29:05,881][__main__][INFO] - Number of regex retries in iteration 5: 0 [2026-03-25 14:29:05,882][__main__][INFO] - agents played in iteration 5 are Alice, Bob [2026-03-25 14:29:06,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:06,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:06,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:29:06,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:29:07,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:29:07,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:29:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:29:09,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:29:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:29:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:29:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:29:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:29:12,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:29:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:29:13,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:29:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:29:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:29:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:29:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:29:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:29:17,701][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:29:18,363][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:29:19,022][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:29:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:29:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:29:21,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:29:21,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:29:22,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:29:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:29:23,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:29:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:29:24,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:29:25,616][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:29:26,276][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:29:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:29:27,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:29:28,256][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:29:28,916][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:29:29,576][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:29:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:29:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:29:31,557][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:29:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:29:32,879][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:29:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:29:34,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:29:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:29:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:29:36,183][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:29:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:29:37,505][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:29:38,167][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:29:39,149][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:29:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:29:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:29:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:29:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:29:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:29:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:29:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:29:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:29:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:29:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:29:46,405][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:29:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:29:47,725][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:29:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:29:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:29:49,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:29:50,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:29:51,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:29:51,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:29:51,586][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:29:52,975][__main__][INFO] - Iteration 6 took 52s (9.89% Gen, 87.45% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 22m 33s. Estimated total time: 14h 31m 2s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 31s. [2026-03-25 14:29:52,978][__main__][INFO] - Starting iteration 6. [2026-03-25 14:29:52,985][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:29:52,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:29:58,220][__main__][INFO] - Number of regex retries in iteration 6: 0 [2026-03-25 14:29:58,222][__main__][INFO] - agents played in iteration 6 are Alice, Bob [2026-03-25 14:29:58,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:58,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:58,852][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:29:58,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:29:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:30:00,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:30:00,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:30:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:30:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:30:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:30:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:30:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:30:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:30:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:30:06,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:30:06,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:30:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:30:08,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:30:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:30:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:30:10,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:30:10,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:30:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:30:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:30:12,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:30:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:30:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:30:14,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:30:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:30:15,993][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:30:16,654][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:30:17,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:30:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:30:18,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:30:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:30:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:30:20,615][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:30:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:30:21,935][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:30:22,595][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:30:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:30:23,914][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:30:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:30:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:30:25,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:30:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:30:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:30:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:30:28,527][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:30:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:30:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:30:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:30:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:30:32,132][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:30:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:30:33,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:30:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:30:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:30:35,427][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:30:36,087][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:30:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:30:37,408][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:30:38,068][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:30:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:30:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:30:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:30:40,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:30:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:30:42,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:30:42,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:30:44,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:30:44,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:30:44,010][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:30:45,297][__main__][INFO] - Iteration 7 took 52s (10.01% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 22m 34s. Estimated total time: 14h 31m 55s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 57s. [2026-03-25 14:30:45,299][__main__][INFO] - Starting iteration 7. [2026-03-25 14:30:45,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:30:45,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:30:50,350][__main__][INFO] - Number of regex retries in iteration 7: 0 [2026-03-25 14:30:50,351][__main__][INFO] - agents played in iteration 7 are Alice, Bob [2026-03-25 14:30:50,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:30:51,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:30:51,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:30:51,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:30:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:30:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:30:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:30:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:30:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:30:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:30:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:30:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:30:56,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:30:57,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:30:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:30:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:30:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:31:00,220][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:31:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:31:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:31:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:31:02,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:31:03,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:31:04,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:31:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:31:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:31:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:31:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:31:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:31:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:31:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:31:09,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:31:10,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:31:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:31:11,430][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:31:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:31:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:31:13,407][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:31:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:31:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:31:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:31:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:31:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:31:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:31:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:31:18,683][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:31:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:31:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:31:20,661][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:31:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:31:21,979][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:31:22,638][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:31:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:31:24,264][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:31:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:31:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:31:26,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:31:26,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:31:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:31:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:31:28,882][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:31:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:31:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:31:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:31:31,520][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:31:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:31:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:31:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:31:34,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:31:34,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:31:35,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:31:35,978][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:31:35,979][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:31:37,444][__main__][INFO] - Iteration 8 took 52s (9.68% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 18m 49s. Estimated total time: 14h 29m 2s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 31s. [2026-03-25 14:31:37,450][__main__][INFO] - Starting iteration 8. [2026-03-25 14:31:37,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:31:37,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:31:42,514][__main__][INFO] - Number of regex retries in iteration 8: 0 [2026-03-25 14:31:42,515][__main__][INFO] - agents played in iteration 8 are Alice, Bob [2026-03-25 14:31:43,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:31:43,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:31:43,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:31:43,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:31:43,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:31:44,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:31:45,066][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:31:45,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:31:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:31:47,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:31:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:31:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:31:49,016][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:31:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:31:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:31:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:31:51,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:31:52,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:31:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:31:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:31:54,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:31:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:31:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:31:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:31:56,925][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:31:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:31:58,242][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:31:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:31:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:32:00,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:32:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:32:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:32:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:32:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:32:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:32:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:32:04,833][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:32:05,492][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:32:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:32:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:32:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:32:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:32:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:32:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:32:10,103][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:32:10,762][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:32:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:32:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:32:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:32:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:32:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:32:14,717][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:32:15,685][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:32:16,344][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:32:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:32:17,662][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:32:18,322][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:32:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:32:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:32:20,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:32:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:32:21,620][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:32:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:32:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:32:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:32:24,258][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:32:24,918][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:32:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:32:26,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:32:27,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:32:28,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:32:28,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:32:28,062][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:32:29,368][__main__][INFO] - Iteration 9 took 51s (9.74% Gen, 87.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 14m 10s. Estimated total time: 14h 25m 15s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 37s. [2026-03-25 14:32:29,371][__main__][INFO] - Starting iteration 9. [2026-03-25 14:32:29,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:32:29,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:32:35,080][__main__][INFO] - Number of regex retries in iteration 9: 0 [2026-03-25 14:32:35,081][__main__][INFO] - agents played in iteration 9 are Alice, Bob [2026-03-25 14:32:35,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:32:35,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:32:35,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:32:35,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:32:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:32:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:32:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:32:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:32:38,987][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:32:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:32:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:32:40,963][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:32:41,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:32:42,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:32:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:32:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:32:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:32:44,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:32:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:32:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:32:46,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:32:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:32:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:32:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:32:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:32:50,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:32:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:32:51,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:32:52,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:32:52,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:32:53,477][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:32:54,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:32:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:32:55,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:32:56,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:32:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:32:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:32:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:32:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:32:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:33:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:33:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:33:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:33:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:33:02,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:33:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:33:04,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:33:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:33:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:33:05,993][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:33:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:33:07,311][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:33:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:33:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:33:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:33:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:33:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:33:11,529][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:33:12,188][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:33:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:33:13,505][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:33:14,164][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:33:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:33:15,482][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:33:16,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:33:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:33:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:33:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:33:18,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:33:19,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:33:20,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:33:20,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:33:20,559][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:33:21,966][__main__][INFO] - Iteration 10 took 52s (10.85% Gen, 86.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 24m 34s. Estimated total time: 14h 36m 32s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 16s. [2026-03-25 14:33:21,969][__main__][INFO] - Starting iteration 10. [2026-03-25 14:33:21,973][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:33:21,973][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:33:28,297][__main__][INFO] - Number of regex retries in iteration 10: 0 [2026-03-25 14:33:28,299][__main__][INFO] - agents played in iteration 10 are Alice, Bob [2026-03-25 14:33:28,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:33:28,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:33:28,966][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:33:28,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:33:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:33:30,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:33:30,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:33:31,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:33:32,194][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:33:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:33:33,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:33:34,167][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:33:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:33:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:33:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:33:36,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:33:37,454][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:33:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:33:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:33:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:33:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:33:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:33:41,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:33:42,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:33:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:33:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:33:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:33:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:33:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:33:46,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:33:46,668][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:33:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:33:47,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:33:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:33:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:33:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:33:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:33:51,277][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:33:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:33:52,594][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:33:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:33:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:33:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:33:55,226][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:33:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:33:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:33:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:33:57,859][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:33:58,521][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:33:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:33:59,837][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:34:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:34:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:34:02,091][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:34:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:34:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:34:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:34:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:34:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:34:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:34:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:34:07,356][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:34:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:34:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:34:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:34:10,647][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:34:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:34:11,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:34:12,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:34:13,779][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:34:13,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:34:13,783][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:34:15,085][__main__][INFO] - Iteration 11 took 53s (11.91% Gen, 85.63% Train). Generation: 6s, Training: 45s. Estimated remaining time: 14h 32m 23s. Estimated total time: 14h 45m 14s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 37s. [2026-03-25 14:34:15,088][__main__][INFO] - Starting iteration 11. [2026-03-25 14:34:15,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:34:15,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:34:20,089][__main__][INFO] - Number of regex retries in iteration 11: 0 [2026-03-25 14:34:20,091][__main__][INFO] - agents played in iteration 11 are Alice, Bob [2026-03-25 14:34:20,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:34:20,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:34:20,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:34:20,645][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:34:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:34:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:34:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:34:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:34:23,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:34:24,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:34:25,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:34:25,847][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:34:26,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:34:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:34:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:34:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:34:29,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:34:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:34:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:34:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:34:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:34:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:34:33,087][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:34:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:34:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:34:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:34:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:34:36,377][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:34:37,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:34:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:34:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:34:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:34:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:34:40,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:34:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:34:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:34:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:34:42,961][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:34:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:34:44,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:34:44,935][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:34:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:34:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:34:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:34:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:34:48,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:34:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:34:49,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:34:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:34:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:34:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:34:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:34:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:34:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:34:54,529][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:34:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:34:55,845][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:34:56,504][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:34:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:34:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:34:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:34:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:59,797][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:35:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:35:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:35:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:35:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:35:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:35:03,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:35:04,527][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:35:06,025][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:06,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:06,032][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:07,351][__main__][INFO] - Iteration 12 took 52s (9.57% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 14h 17m 17s. Estimated total time: 14h 31m 1s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 30s. [2026-03-25 14:35:07,354][__main__][INFO] - Starting iteration 12. [2026-03-25 14:35:07,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:07,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:35:12,551][__main__][INFO] - Number of regex retries in iteration 12: 0 [2026-03-25 14:35:12,553][__main__][INFO] - agents played in iteration 12 are Alice, Bob [2026-03-25 14:35:13,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:35:13,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:35:13,194][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:35:13,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:35:13,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:35:14,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:35:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:35:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:35:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:35:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:35:17,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:35:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:35:19,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:35:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:35:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:35:21,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:35:21,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:35:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:35:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:35:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:35:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:35:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:35:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:35:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:35:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:35:27,663][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:35:28,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:35:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:35:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:35:30,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:35:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:35:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:35:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:35:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:35:33,595][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:35:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:35:34,912][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:35:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:35:36,229][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:35:36,887][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:35:37,546][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:35:38,204][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:35:38,862][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:35:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:35:40,180][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:35:40,842][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:35:41,502][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:35:42,167][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:35:42,826][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:35:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:35:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:35:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:35:45,812][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:35:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:35:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:35:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:35:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:35:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:35:49,773][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:35:50,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:35:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:35:51,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:35:52,415][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:35:53,075][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:35:53,736][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:35:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:35:55,058][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:35:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:35:56,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:35:57,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:35:58,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:58,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:58,493][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:59,737][__main__][INFO] - Iteration 13 took 52s (9.91% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 18m 24s. Estimated total time: 14h 33m 0s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 30s. [2026-03-25 14:35:59,739][__main__][INFO] - Starting iteration 13. [2026-03-25 14:35:59,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:59,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:36:04,915][__main__][INFO] - Number of regex retries in iteration 13: 0 [2026-03-25 14:36:04,917][__main__][INFO] - agents played in iteration 13 are Alice, Bob [2026-03-25 14:36:05,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:05,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:05,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:36:05,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:36:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:36:06,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:36:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:36:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:36:08,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:36:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:36:10,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:36:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:36:11,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:36:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:36:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:36:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:36:14,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:36:14,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:36:15,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:36:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:36:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:36:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:36:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:36:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:36:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:36:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:36:20,710][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:36:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:36:22,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:36:22,684][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:36:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:36:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:36:24,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:36:25,316][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:36:25,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:36:26,633][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:36:27,291][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:36:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:36:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:36:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:36:29,921][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:36:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:36:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:36:31,896][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:36:32,555][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:36:33,214][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:36:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:36:34,529][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:36:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:36:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:36:36,503][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:36:37,161][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:36:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:36:38,791][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:36:39,450][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:36:40,108][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:36:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:36:41,428][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:36:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:36:42,748][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:36:43,406][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:36:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:36:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:36:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:36:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:36:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:36:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:36:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:36:48,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:36:49,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:36:50,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:36:50,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:36:50,727][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:36:52,012][__main__][INFO] - Iteration 14 took 52s (9.89% Gen, 87.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 15m 42s. Estimated total time: 14h 31m 10s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2026-03-25 14:36:52,015][__main__][INFO] - Starting iteration 14. [2026-03-25 14:36:52,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:36:52,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:36:57,344][__main__][INFO] - Number of regex retries in iteration 14: 0 [2026-03-25 14:36:57,345][__main__][INFO] - agents played in iteration 14 are Alice, Bob [2026-03-25 14:36:57,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:57,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:57,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:36:57,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:36:58,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:36:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:36:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:37:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:37:01,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:37:01,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:37:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:37:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:37:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:37:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:37:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:37:05,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:37:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:37:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:37:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:37:08,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:37:09,078][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:37:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:37:10,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:37:11,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:37:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:37:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:37:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:37:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:37:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:37:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:37:15,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:37:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:37:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:37:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:37:18,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:37:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:37:19,633][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:37:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:37:20,954][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:37:21,613][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:37:22,273][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:37:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:37:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:37:24,256][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:37:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:37:25,574][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:37:26,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:37:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:37:27,554][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:37:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:37:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:37:29,533][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:37:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:37:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:37:31,838][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:37:32,497][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:37:33,156][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:37:33,814][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:37:34,473][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:37:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:37:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:37:36,450][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:37:37,111][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:37:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:37:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:37:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:37:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:37:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:37:41,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:37:41,853][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:37:43,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:37:43,148][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:37:43,149][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:37:44,561][__main__][INFO] - Iteration 15 took 52s (10.13% Gen, 87.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 19m 23s. Estimated total time: 14h 35m 44s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 52s. [2026-03-25 14:37:44,564][__main__][INFO] - Starting iteration 15. [2026-03-25 14:37:44,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:37:44,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:37:50,559][__main__][INFO] - Number of regex retries in iteration 15: 0 [2026-03-25 14:37:50,560][__main__][INFO] - agents played in iteration 15 are Alice, Bob [2026-03-25 14:37:51,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:37:51,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:37:51,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:37:51,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:37:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:37:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:37:53,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:37:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:37:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:37:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:37:56,031][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:37:56,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:37:57,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:37:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:37:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:37:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:37:59,985][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:38:00,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:38:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:38:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:38:02,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:38:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:38:03,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:38:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:38:05,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:38:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:38:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:38:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:38:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:38:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:38:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:38:09,872][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:38:10,531][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:38:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:38:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:38:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:38:13,168][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:38:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:38:14,486][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:38:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:38:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:38:16,462][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:38:17,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:38:17,779][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:38:18,438][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:38:19,096][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:38:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:38:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:38:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:38:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:38:22,391][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:38:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:38:24,027][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:38:24,685][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:38:25,344][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:38:26,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:38:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:38:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:38:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:38:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:38:29,301][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:38:29,962][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:38:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:38:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:38:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:38:32,599][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:38:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:38:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:38:34,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:38:35,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:38:36,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:38:36,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:38:36,562][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:38:37,871][__main__][INFO] - Iteration 16 took 53s (11.24% Gen, 86.30% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 31m 10s. Estimated total time: 14h 48m 24s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 50s, 500 more iterations: 7h 24m 12s. [2026-03-25 14:38:37,873][__main__][INFO] - Starting iteration 16. [2026-03-25 14:38:37,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:38:37,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:38:43,167][__main__][INFO] - Number of regex retries in iteration 16: 0 [2026-03-25 14:38:43,169][__main__][INFO] - agents played in iteration 16 are Alice, Bob [2026-03-25 14:38:43,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:38:43,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:38:43,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:38:43,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:38:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:38:45,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:38:45,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:38:46,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:38:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:38:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:38:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:38:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:38:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:38:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:38:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:38:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:38:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:38:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:38:53,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:38:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:38:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:38:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:38:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:38:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:38:57,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:38:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:38:58,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:38:59,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:39:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:39:00,953][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:39:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:39:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:39:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:39:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:39:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:39:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:39:05,576][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:39:06,236][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:39:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:39:07,556][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:39:08,216][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:39:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:39:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:39:10,198][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:39:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:39:11,520][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:39:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:39:12,840][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:39:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:39:14,162][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:39:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:39:15,481][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:39:16,467][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:39:17,128][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:39:17,788][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:39:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:39:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:39:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:39:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:39:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:39:21,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:39:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:39:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:39:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:39:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:39:25,053][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:39:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:39:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:39:27,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:39:27,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:39:29,041][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:39:29,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:39:29,045][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:39:30,346][__main__][INFO] - Iteration 17 took 52s (10.08% Gen, 87.43% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 16m 24s. Estimated total time: 14h 34m 30s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 15s. [2026-03-25 14:39:30,349][__main__][INFO] - Starting iteration 17. [2026-03-25 14:39:30,353][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:39:30,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:39:35,682][__main__][INFO] - Number of regex retries in iteration 17: 0 [2026-03-25 14:39:35,684][__main__][INFO] - agents played in iteration 17 are Alice, Bob [2026-03-25 14:39:36,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:39:36,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:39:36,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:39:36,316][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:39:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:39:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:39:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:39:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:39:39,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:39:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:39:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:39:41,546][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:39:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:39:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:39:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:39:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:39:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:39:45,497][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:39:46,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:39:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:39:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:39:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:39:48,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:39:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:39:50,106][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:39:50,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:39:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:39:52,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:39:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:39:53,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:39:54,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:39:54,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:39:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:39:56,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:39:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:39:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:39:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:39:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:39:59,328][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:39:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:40:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:40:01,304][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:40:01,963][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:40:02,622][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:40:03,280][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:40:03,939][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:40:04,598][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:40:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:40:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:40:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:40:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:40:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:40:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:40:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:40:10,163][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:40:10,821][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:40:11,480][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:40:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:40:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:40:13,455][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:40:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:40:14,773][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:40:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:40:16,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:40:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:40:17,407][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:40:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:40:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:40:19,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:40:20,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:40:21,265][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:40:21,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:40:21,270][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:40:22,587][__main__][INFO] - Iteration 18 took 52s (10.21% Gen, 87.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 11m 36s. Estimated total time: 14h 30m 35s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 17s. [2026-03-25 14:40:22,591][__main__][INFO] - Starting iteration 18. [2026-03-25 14:40:22,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:40:22,595][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:40:27,724][__main__][INFO] - Number of regex retries in iteration 18: 0 [2026-03-25 14:40:27,726][__main__][INFO] - agents played in iteration 18 are Alice, Bob [2026-03-25 14:40:28,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:40:28,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:40:28,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:40:28,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:40:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:40:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:40:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:40:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:40:31,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:40:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:40:32,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:40:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:40:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:40:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:40:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:40:36,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:40:36,808][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:40:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:40:38,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:40:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:40:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:40:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:40:40,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:40:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:40:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:40:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:40:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:40:44,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:40:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:40:45,363][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:40:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:40:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:40:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:40:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:40:48,654][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:40:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:40:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:40:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:40:51,288][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:40:51,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:40:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:40:53,262][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:40:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:40:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:40:55,237][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:40:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:40:56,554][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:40:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:40:57,872][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:40:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:40:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:40:59,848][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:41:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:41:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:41:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:41:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:41:03,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:41:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:41:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:41:05,426][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:41:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:41:06,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:41:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:41:08,065][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:41:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:41:09,384][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:41:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:41:10,702][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:41:11,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:41:12,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:41:13,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:41:13,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:41:13,399][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:41:16,129][__main__][INFO] - Iteration 19 took 53s (9.58% Gen, 85.31% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 32m 24s. Estimated total time: 14h 52m 16s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 13s, 500 more iterations: 7h 26m 8s. [2026-03-25 14:41:16,132][__main__][INFO] - Starting iteration 19. [2026-03-25 14:41:16,136][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:41:16,137][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:41:21,421][__main__][INFO] - Number of regex retries in iteration 19: 0 [2026-03-25 14:41:21,423][__main__][INFO] - agents played in iteration 19 are Alice, Bob [2026-03-25 14:41:22,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:41:22,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:41:22,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:41:22,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:41:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:41:23,491][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:41:24,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:41:24,808][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:41:25,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:41:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:41:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:41:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:41:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:41:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:41:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:41:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:41:30,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:41:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:41:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:41:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:41:33,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:41:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:41:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:41:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:41:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:41:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:41:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:41:37,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:41:38,632][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:41:39,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:41:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:41:40,608][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:41:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:41:41,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:41:42,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:41:43,243][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:41:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:41:44,560][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:41:45,219][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:41:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:41:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:41:47,194][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:41:47,853][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:41:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:41:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:41:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:41:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:41:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:41:51,812][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:41:52,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:41:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:41:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:41:54,820][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:41:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:41:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:41:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:41:57,457][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:41:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:41:58,775][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:41:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:42:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:42:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:42:01,408][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:42:02,067][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:42:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:42:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:42:04,041][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:42:04,700][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:42:05,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:42:06,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:42:07,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:42:07,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:42:07,428][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:42:08,705][__main__][INFO] - Iteration 20 took 52s (10.06% Gen, 87.51% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 15m 26s. Estimated total time: 14h 36m 11s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 5s. [2026-03-25 14:42:08,708][__main__][INFO] - Starting iteration 20. [2026-03-25 14:42:08,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:42:08,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:42:13,847][__main__][INFO] - Number of regex retries in iteration 20: 0 [2026-03-25 14:42:13,849][__main__][INFO] - agents played in iteration 20 are Alice, Bob [2026-03-25 14:42:14,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:42:14,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:42:14,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:42:14,416][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:42:15,073][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:42:15,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:42:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:42:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:42:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:42:18,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:42:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:42:19,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:42:20,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:42:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:42:21,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:42:22,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:42:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:42:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:42:24,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:42:24,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:42:25,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:42:26,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:42:26,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:42:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:42:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:42:28,843][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:42:29,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:42:30,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:42:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:42:31,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:42:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:42:32,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:42:33,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:42:34,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:42:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:42:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:42:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:42:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:42:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:42:38,057][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:42:38,718][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:42:39,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:42:40,035][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:42:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:42:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:42:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:42:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:42:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:42:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:42:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:42:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:42:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:42:46,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:42:47,595][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:42:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:42:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:42:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:42:50,230][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:42:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:42:51,546][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:42:52,205][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:42:52,864][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:42:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:42:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:42:54,838][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:42:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:42:56,154][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:42:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:42:57,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:42:58,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:42:59,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:42:59,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:42:59,553][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:43:00,849][__main__][INFO] - Iteration 21 took 52s (9.85% Gen, 87.66% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 7m 21s. Estimated total time: 14h 28m 57s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 28s. [2026-03-25 14:43:00,851][__main__][INFO] - Starting iteration 21. [2026-03-25 14:43:00,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:43:00,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:43:05,961][__main__][INFO] - Number of regex retries in iteration 21: 0 [2026-03-25 14:43:05,962][__main__][INFO] - agents played in iteration 21 are Alice, Bob [2026-03-25 14:43:06,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:06,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:06,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:43:06,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:43:07,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:43:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:43:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:43:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:43:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:43:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:43:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:43:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:43:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:43:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:43:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:43:14,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:43:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:43:15,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:43:16,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:43:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:43:17,816][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:43:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:43:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:43:19,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:43:20,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:43:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:43:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:43:22,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:43:23,095][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:43:23,755][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:43:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:43:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:43:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:43:26,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:43:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:43:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:43:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:43:29,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:43:29,697][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:43:30,357][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:43:31,018][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:43:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:43:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:43:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:43:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:43:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:43:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:43:35,638][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:43:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:43:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:43:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:43:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:43:39,264][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:43:39,924][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:43:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:43:41,244][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:43:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:43:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:43:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:43:43,882][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:43:44,542][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:43:45,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:43:45,862][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:43:46,521][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:43:47,181][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:43:47,841][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:43:48,501][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:43:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:43:49,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:43:50,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:43:51,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:43:51,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:43:51,663][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:43:52,961][__main__][INFO] - Iteration 22 took 52s (9.80% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 5m 58s. Estimated total time: 14h 28m 27s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 13s. [2026-03-25 14:43:52,963][__main__][INFO] - Starting iteration 22. [2026-03-25 14:43:52,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:43:52,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:43:58,036][__main__][INFO] - Number of regex retries in iteration 22: 0 [2026-03-25 14:43:58,038][__main__][INFO] - agents played in iteration 22 are Alice, Bob [2026-03-25 14:43:58,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:58,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:58,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:43:58,713][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:43:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:43:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:44:00,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:44:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:44:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:44:02,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:44:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:44:03,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:44:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:44:05,253][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:44:05,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:44:06,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:44:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:44:07,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:44:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:44:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:44:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:44:10,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:44:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:44:11,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:44:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:44:13,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:44:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:44:14,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:44:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:44:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:44:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:44:17,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:44:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:44:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:44:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:44:19,753][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:44:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:44:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:44:21,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:44:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:44:23,051][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:44:23,710][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:44:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:44:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:44:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:44:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:44:27,006][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:44:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:44:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:44:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:44:29,645][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:44:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:44:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:44:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:44:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:44:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:44:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:44:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:44:35,219][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:44:35,878][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:44:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:44:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:44:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:44:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:44:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:44:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:44:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:44:41,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:44:41,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:44:42,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:44:43,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:44:43,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:44:43,579][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:44:44,907][__main__][INFO] - Iteration 23 took 51s (9.76% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 2m 21s. Estimated total time: 14h 25m 42s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 51s. [2026-03-25 14:44:44,910][__main__][INFO] - Starting iteration 23. [2026-03-25 14:44:44,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:44:44,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:44:49,971][__main__][INFO] - Number of regex retries in iteration 23: 0 [2026-03-25 14:44:49,972][__main__][INFO] - agents played in iteration 23 are Alice, Bob [2026-03-25 14:44:50,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:50,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:50,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:44:50,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:44:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:44:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:44:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:44:53,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:44:54,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:44:54,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:44:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:44:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:44:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:44:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:44:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:44:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:44:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:45:00,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:45:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:45:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:45:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:45:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:45:03,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:45:04,045][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:45:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:45:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:45:06,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:45:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:45:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:45:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:45:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:45:09,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:45:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:45:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:45:11,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:45:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:45:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:45:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:45:13,931][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:45:14,590][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:45:15,250][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:45:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:45:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:45:17,226][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:45:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:45:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:45:19,203][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:45:19,862][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:45:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:45:21,180][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:45:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:45:22,498][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:45:23,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:45:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:45:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:45:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:45:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:45:26,767][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:45:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:45:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:45:28,746][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:45:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:45:30,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:45:30,725][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:45:31,384][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:45:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:45:32,703][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:45:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:45:34,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:45:34,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:45:35,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:45:35,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:45:35,861][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:45:37,163][__main__][INFO] - Iteration 24 took 52s (9.68% Gen, 87.82% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 6m 37s. Estimated total time: 14h 30m 50s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 25s. [2026-03-25 14:45:37,166][__main__][INFO] - Starting iteration 24. [2026-03-25 14:45:37,170][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:45:37,170][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:45:42,188][__main__][INFO] - Number of regex retries in iteration 24: 0 [2026-03-25 14:45:42,190][__main__][INFO] - agents played in iteration 24 are Alice, Bob [2026-03-25 14:45:42,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:45:42,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:45:42,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:45:42,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:45:43,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:45:44,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:45:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:45:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:45:46,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:45:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:45:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:45:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:45:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:45:49,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:45:50,104][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:45:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:45:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:45:52,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:45:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:45:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:45:54,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:45:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:45:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:45:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:45:56,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:45:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:45:58,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:45:58,663][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:45:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:45:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:46:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:46:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:46:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:46:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:46:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:46:03,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:46:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:46:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:46:05,909][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:46:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:46:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:46:07,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:46:08,545][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:46:09,204][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:46:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:46:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:46:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:46:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:46:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:46:13,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:46:13,819][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:46:14,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:46:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:46:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:46:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:46:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:46:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:46:18,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:46:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:46:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:46:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:46:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:46:22,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:46:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:46:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:46:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:46:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:46:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:46:25,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:46:26,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:46:27,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:46:27,906][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:46:27,908][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:46:29,066][__main__][INFO] - Iteration 25 took 51s (9.67% Gen, 88.09% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 59m 53s. Estimated total time: 14h 24m 58s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 29s. [2026-03-25 14:46:29,069][__main__][INFO] - Starting iteration 25. [2026-03-25 14:46:29,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:46:29,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:46:35,113][__main__][INFO] - Number of regex retries in iteration 25: 0 [2026-03-25 14:46:35,115][__main__][INFO] - agents played in iteration 25 are Alice, Bob [2026-03-25 14:46:35,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:46:35,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:46:35,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:46:35,771][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:46:36,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:46:37,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:46:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:46:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:46:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:46:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:46:40,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:46:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:46:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:46:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:46:42,956][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:46:43,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:46:44,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:46:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:46:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:46:46,247][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:46:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:46:47,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:46:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:46:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:46:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:46:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:46:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:46:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:46:52,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:46:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:46:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:46:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:46:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:46:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:46:56,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:46:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:46:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:46:58,104][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:46:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:46:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:47:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:47:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:47:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:47:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:47:02,717][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:47:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:47:04,034][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:47:04,692][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:47:05,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:47:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:47:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:47:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:47:08,389][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:47:09,050][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:47:09,712][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:47:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:47:11,037][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:47:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:47:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:47:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:47:13,681][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:47:14,341][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:47:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:47:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:47:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:47:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:47:17,645][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:47:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:47:18,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:47:19,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:47:20,900][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:47:20,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:47:20,905][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:47:22,233][__main__][INFO] - Iteration 26 took 53s (11.36% Gen, 86.13% Train). Generation: 6s, Training: 45s. Estimated remaining time: 14h 20m 3s. Estimated total time: 14h 46m 1s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 36s, 500 more iterations: 7h 23m 0s. [2026-03-25 14:47:22,237][__main__][INFO] - Starting iteration 26. [2026-03-25 14:47:22,243][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:47:22,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:47:27,433][__main__][INFO] - Number of regex retries in iteration 26: 0 [2026-03-25 14:47:27,434][__main__][INFO] - agents played in iteration 26 are Alice, Bob [2026-03-25 14:47:28,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:47:28,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:47:28,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:47:28,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:47:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:47:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:47:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:47:30,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:47:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:47:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:47:32,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:47:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:47:34,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:47:34,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:47:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:47:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:47:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:47:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:47:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:47:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:47:39,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:47:40,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:47:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:47:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:47:41,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:47:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:47:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:47:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:47:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:47:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:47:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:47:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:47:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:47:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:47:48,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:47:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:47:49,888][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:47:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:47:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:47:51,864][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:47:52,524][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:47:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:47:53,843][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:47:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:47:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:47:55,822][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:47:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:47:57,146][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:47:57,805][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:47:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:47:59,125][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:47:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:48:00,766][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:48:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:48:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:48:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:48:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:48:04,061][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:48:04,720][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:48:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:48:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:48:06,697][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:48:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:48:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:48:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:48:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:48:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:48:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:48:11,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:48:12,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:48:13,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:48:13,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:48:13,189][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:48:14,560][__main__][INFO] - Iteration 27 took 52s (9.92% Gen, 87.45% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 5m 9s. Estimated total time: 14h 32m 0s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 0s. [2026-03-25 14:48:14,563][__main__][INFO] - Starting iteration 27. [2026-03-25 14:48:14,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:48:14,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:48:24,498][__main__][INFO] - Number of regex retries in iteration 27: 0 [2026-03-25 14:48:24,499][__main__][INFO] - agents played in iteration 27 are Alice, Bob [2026-03-25 14:48:25,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:48:25,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:48:25,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:48:25,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:48:25,886][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:48:26,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:48:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:48:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:48:28,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:48:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:48:29,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:48:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:48:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:48:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:48:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:48:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:48:33,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:48:34,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:48:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:48:35,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:48:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:48:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:48:37,696][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:48:38,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:48:39,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:48:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:48:40,336][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:48:40,996][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:48:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:48:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:48:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:48:43,637][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:48:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:48:44,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:48:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:48:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:48:46,929][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:48:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:48:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:48:48,904][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:48:49,562][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:48:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:48:50,880][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:48:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:48:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:48:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:48:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:48:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:48:54,832][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:48:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:48:56,149][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:48:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:48:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:48:58,463][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:48:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:48:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:49:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:49:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:49:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:49:02,412][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:49:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:49:03,730][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:49:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:49:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:49:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:49:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:49:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:49:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:49:08,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:49:09,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:49:10,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:49:10,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:49:10,705][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:49:11,965][__main__][INFO] - Iteration 28 took 57s (17.30% Gen, 80.50% Train). Generation: 9s, Training: 46s. Estimated remaining time: 15h 28m 51s. Estimated total time: 15h 56m 39s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 39s, 500 more iterations: 7h 58m 19s. [2026-03-25 14:49:11,968][__main__][INFO] - Starting iteration 28. [2026-03-25 14:49:11,972][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:49:11,973][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:49:17,031][__main__][INFO] - Number of regex retries in iteration 28: 0 [2026-03-25 14:49:17,032][__main__][INFO] - agents played in iteration 28 are Alice, Bob [2026-03-25 14:49:17,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:49:17,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:49:17,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:49:17,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:49:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:49:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:49:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:49:20,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:49:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:49:21,620][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:49:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:49:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:49:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:49:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:49:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:49:25,572][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:49:26,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:49:26,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:49:27,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:49:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:49:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:49:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:49:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:49:30,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:49:31,499][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:49:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:49:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:49:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:49:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:49:34,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:49:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:49:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:49:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:49:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:49:38,085][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:49:38,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:49:39,403][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:49:40,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:49:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:49:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:49:42,042][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:49:42,700][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:49:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:49:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:49:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:49:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:49:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:49:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:49:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:49:47,968][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:49:48,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:49:49,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:49:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:49:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:49:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:49:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:49:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:49:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:49:54,217][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:49:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:49:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:49:56,193][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:49:56,851][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:49:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:49:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:49:58,828][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:49:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:50:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:50:00,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:50:01,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:50:02,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:50:02,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:50:02,705][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:50:03,908][__main__][INFO] - Iteration 29 took 51s (9.74% Gen, 87.94% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 56m 58s. Estimated total time: 14h 25m 38s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 49s. [2026-03-25 14:50:03,911][__main__][INFO] - Starting iteration 29. [2026-03-25 14:50:03,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:50:03,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:50:09,134][__main__][INFO] - Number of regex retries in iteration 29: 0 [2026-03-25 14:50:09,135][__main__][INFO] - agents played in iteration 29 are Alice, Bob [2026-03-25 14:50:09,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:50:09,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:50:09,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:50:09,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:50:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:50:11,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:50:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:50:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:50:13,076][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:50:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:50:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:50:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:50:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:50:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:50:17,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:50:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:50:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:50:19,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:50:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:50:20,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:50:20,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:50:21,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:50:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:50:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:50:23,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:50:24,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:50:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:50:25,582][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:50:26,241][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:50:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:50:27,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:50:28,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:50:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:50:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:50:30,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:50:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:50:31,508][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:50:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:50:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:50:33,482][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:50:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:50:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:50:35,460][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:50:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:50:36,777][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:50:37,435][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:50:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:50:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:50:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:50:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:50:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:50:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:50:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:50:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:50:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:50:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:50:44,991][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:50:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:50:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:50:46,972][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:50:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:50:48,292][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:50:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:50:49,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:50:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:50:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:50:51,591][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:50:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:50:52,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:50:53,692][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:50:54,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:50:54,779][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:50:54,780][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:50:56,204][__main__][INFO] - Iteration 30 took 52s (9.98% Gen, 87.29% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 1m 58s. Estimated total time: 14h 31m 30s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 45s. [2026-03-25 14:50:56,206][__main__][INFO] - Starting iteration 30. [2026-03-25 14:50:56,211][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:50:56,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:51:01,633][__main__][INFO] - Number of regex retries in iteration 30: 0 [2026-03-25 14:51:01,634][__main__][INFO] - agents played in iteration 30 are Alice, Bob [2026-03-25 14:51:02,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:02,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:02,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:51:02,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:51:02,952][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:51:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:51:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:51:04,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:51:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:51:06,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:51:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:51:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:51:08,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:51:08,834][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:51:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:51:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:51:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:51:11,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:51:12,125][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:51:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:51:13,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:51:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:51:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:51:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:51:16,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:51:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:51:17,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:51:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:51:18,711][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:51:19,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:51:20,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:51:20,687][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:51:21,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:51:22,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:51:22,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:51:23,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:51:23,978][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:51:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:51:25,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:51:25,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:51:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:51:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:51:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:51:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:51:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:51:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:51:30,562][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:51:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:51:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:51:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:51:33,194][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:51:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:51:34,828][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:51:35,487][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:51:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:51:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:51:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:51:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:51:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:51:39,438][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:51:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:51:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:51:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:51:42,073][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:51:42,732][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:51:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:51:44,049][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:51:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:51:45,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:51:46,215][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:51:47,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:51:47,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:51:47,492][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:51:48,695][__main__][INFO] - Iteration 31 took 52s (10.33% Gen, 87.37% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 4m 21s. Estimated total time: 14h 34m 46s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 23s. [2026-03-25 14:51:48,698][__main__][INFO] - Starting iteration 31. [2026-03-25 14:51:48,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:51:48,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:51:53,769][__main__][INFO] - Number of regex retries in iteration 31: 0 [2026-03-25 14:51:53,771][__main__][INFO] - agents played in iteration 31 are Alice, Bob [2026-03-25 14:51:54,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:54,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:54,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:51:54,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:51:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:51:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:51:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:51:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:51:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:51:58,370][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:51:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:51:59,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:52:00,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:52:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:52:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:52:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:52:02,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:52:03,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:52:04,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:52:04,950][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:52:05,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:52:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:52:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:52:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:52:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:52:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:52:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:52:10,219][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:52:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:52:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:52:12,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:52:12,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:52:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:52:14,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:52:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:52:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:52:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:52:16,804][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:52:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:52:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:52:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:52:19,440][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:52:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:52:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:52:21,416][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:52:22,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:52:22,734][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:52:23,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:52:24,050][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:52:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:52:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:52:26,024][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:52:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:52:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:52:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:52:28,952][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:52:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:52:30,268][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:52:30,927][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:52:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:52:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:52:32,903][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:52:33,562][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:52:34,221][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:52:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:52:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:52:36,196][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:52:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:52:37,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:52:38,299][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:52:39,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:52:39,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:52:39,403][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:52:40,798][__main__][INFO] - Iteration 32 took 52s (9.73% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 57m 2s. Estimated total time: 14h 28m 18s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 9s. [2026-03-25 14:52:40,801][__main__][INFO] - Starting iteration 32. [2026-03-25 14:52:40,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:52:40,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:52:45,621][__main__][INFO] - Number of regex retries in iteration 32: 0 [2026-03-25 14:52:45,623][__main__][INFO] - agents played in iteration 32 are Alice, Bob [2026-03-25 14:52:46,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:52:46,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:52:46,280][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:52:46,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:52:46,936][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:52:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:52:48,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:52:48,875][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:52:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:52:50,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:52:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:52:51,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:52:52,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:52:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:52:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:52:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:52:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:52:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:52:56,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:52:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:52:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:52:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:52:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:52:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:53:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:53:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:53:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:53:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:53:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:53:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:53:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:53:04,677][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:53:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:53:05,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:53:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:53:07,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:53:07,969][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:53:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:53:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:53:09,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:53:10,603][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:53:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:53:11,921][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:53:12,579][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:53:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:53:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:53:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:53:15,216][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:53:15,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:53:16,532][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:53:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:53:17,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:53:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:53:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:53:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:53:20,778][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:53:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:53:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:53:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:53:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:53:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:53:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:53:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:53:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:53:26,706][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:53:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:53:28,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:53:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:53:29,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:53:30,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:53:31,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:53:31,538][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:53:31,540][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:53:32,863][__main__][INFO] - Iteration 33 took 52s (9.25% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 55m 31s. Estimated total time: 14h 27m 40s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 50s. [2026-03-25 14:53:32,866][__main__][INFO] - Starting iteration 33. [2026-03-25 14:53:32,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:53:32,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:53:37,683][__main__][INFO] - Number of regex retries in iteration 33: 0 [2026-03-25 14:53:37,684][__main__][INFO] - agents played in iteration 33 are Alice, Bob [2026-03-25 14:53:38,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:38,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:38,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:53:38,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:53:38,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:53:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:53:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:53:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:53:41,587][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:53:42,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:53:42,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:53:43,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:53:44,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:53:44,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:53:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:53:46,191][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:53:46,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:53:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:53:48,164][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:53:48,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:53:49,481][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:53:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:53:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:53:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:53:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:53:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:53:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:53:54,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:53:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:53:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:53:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:53:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:53:57,376][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:53:58,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:53:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:53:59,351][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:54:00,010][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:54:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:54:01,326][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:54:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:54:02,642][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:54:03,300][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:54:03,959][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:54:04,617][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:54:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:54:05,934][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:54:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:54:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:54:07,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:54:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:54:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:54:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:54:10,783][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:54:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:54:12,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:54:12,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:54:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:54:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:54:14,735][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:54:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:54:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:54:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:54:17,370][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:54:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:54:18,689][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:54:19,347][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:54:20,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:54:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:54:21,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:54:22,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:54:23,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:54:23,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:54:23,192][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:54:24,610][__main__][INFO] - Iteration 34 took 51s (9.30% Gen, 87.95% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 49m 21s. Estimated total time: 14h 22m 22s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 11s. [2026-03-25 14:54:24,613][__main__][INFO] - Starting iteration 34. [2026-03-25 14:54:24,617][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:54:24,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:54:29,715][__main__][INFO] - Number of regex retries in iteration 34: 0 [2026-03-25 14:54:29,716][__main__][INFO] - agents played in iteration 34 are Alice, Bob [2026-03-25 14:54:30,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:54:30,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:54:30,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:54:30,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:54:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:54:31,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:54:32,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:54:32,946][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:54:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:54:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:54:34,918][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:54:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:54:36,233][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:54:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:54:37,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:54:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:54:38,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:54:39,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:54:40,182][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:54:40,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:54:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:54:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:54:42,815][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:54:43,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:54:44,132][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:54:44,791][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:54:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:54:46,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:54:46,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:54:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:54:48,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:54:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:54:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:54:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:54:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:54:51,377][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:54:52,036][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:54:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:54:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:54:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:54:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:54:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:54:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:54:56,646][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:54:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:54:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:54:58,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:54:59,282][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:54:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:55:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:55:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:55:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:55:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:55:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:55:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:55:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:55:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:55:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:55:06,792][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:55:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:55:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:55:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:55:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:55:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:55:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:55:11,401][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:55:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:55:12,717][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:55:13,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:55:14,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:55:15,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:55:15,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:55:15,237][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:55:16,514][__main__][INFO] - Iteration 35 took 51s (9.82% Gen, 87.71% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 51m 6s. Estimated total time: 14h 24m 59s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 29s. [2026-03-25 14:55:16,516][__main__][INFO] - Starting iteration 35. [2026-03-25 14:55:16,520][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:55:16,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:55:21,377][__main__][INFO] - Number of regex retries in iteration 35: 0 [2026-03-25 14:55:21,378][__main__][INFO] - agents played in iteration 35 are Alice, Bob [2026-03-25 14:55:21,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:55:22,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:55:22,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:55:22,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:55:22,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:55:23,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:55:23,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:55:24,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:55:25,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:55:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:55:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:55:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:55:27,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:55:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:55:29,230][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:55:29,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:55:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:55:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:55:31,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:55:32,519][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:55:33,177][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:55:33,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:55:34,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:55:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:55:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:55:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:55:37,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:55:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:55:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:55:39,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:55:39,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:55:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:55:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:55:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:55:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:55:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:55:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:55:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:55:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:55:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:55:46,344][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:55:47,003][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:55:47,662][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:55:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:55:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:55:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:55:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:55:50,958][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:55:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:55:52,274][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:55:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:55:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:55:54,563][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:55:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:55:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:55:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:55:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:55:57,856][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:55:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:55:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:55:59,833][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:56:00,492][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:56:01,150][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:56:01,809][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:56:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:56:03,127][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:56:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:56:04,444][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:56:05,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:56:05,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:56:06,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:56:06,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:56:06,941][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:56:08,602][__main__][INFO] - Iteration 36 took 52s (9.33% Gen, 87.48% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 53m 18s. Estimated total time: 14h 28m 3s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 1s. [2026-03-25 14:56:08,605][__main__][INFO] - Starting iteration 36. [2026-03-25 14:56:08,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:56:08,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:56:13,440][__main__][INFO] - Number of regex retries in iteration 36: 0 [2026-03-25 14:56:13,441][__main__][INFO] - agents played in iteration 36 are Alice, Bob [2026-03-25 14:56:14,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:56:14,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:56:14,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:56:14,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:56:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:56:15,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:56:16,044][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:56:16,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:56:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:56:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:56:18,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:56:19,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:56:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:56:20,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:56:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:56:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:56:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:56:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:56:23,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:56:24,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:56:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:56:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:56:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:56:27,247][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:56:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:56:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:56:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:56:29,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:56:30,543][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:56:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:56:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:56:32,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:56:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:56:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:56:34,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:56:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:56:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:56:36,476][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:56:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:56:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:56:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:56:39,114][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:56:39,774][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:56:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:56:41,094][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:56:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:56:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:56:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:56:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:56:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:56:45,049][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:56:45,709][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:56:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:56:47,259][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:56:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:56:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:56:49,236][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:56:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:56:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:56:51,216][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:56:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:56:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:56:53,194][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:56:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:56:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:56:55,171][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:56:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:56:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:56:57,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:56:57,997][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:56:59,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:56:59,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:56:59,197][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:57:00,620][__main__][INFO] - Iteration 37 took 52s (9.29% Gen, 87.97% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 51m 16s. Estimated total time: 14h 26m 53s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 26s. [2026-03-25 14:57:00,623][__main__][INFO] - Starting iteration 37. [2026-03-25 14:57:00,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:57:00,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:57:05,534][__main__][INFO] - Number of regex retries in iteration 37: 0 [2026-03-25 14:57:05,535][__main__][INFO] - agents played in iteration 37 are Alice, Bob [2026-03-25 14:57:06,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:57:06,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:57:06,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:57:06,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:57:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:57:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:57:08,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:57:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:57:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:57:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:57:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:57:11,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:57:12,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:57:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:57:13,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:57:14,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:57:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:57:15,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:57:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:57:16,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:57:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:57:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:57:18,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:57:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:57:20,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:57:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:57:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:57:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:57:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:57:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:57:24,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:57:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:57:25,331][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:57:25,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:57:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:57:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:57:27,964][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:57:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:57:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:57:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:57:30,599][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:57:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:57:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:57:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:57:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:57:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:57:34,552][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:57:35,211][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:57:35,870][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:57:36,529][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:57:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:57:37,847][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:57:38,767][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:57:39,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:57:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:57:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:57:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:57:42,060][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:57:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:57:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:57:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:57:44,696][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:57:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:57:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:57:46,672][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:57:47,331][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:57:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:57:48,649][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:57:49,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:57:49,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:57:51,178][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:57:51,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:57:51,183][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:57:52,548][__main__][INFO] - Iteration 38 took 51s (9.45% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 48m 54s. Estimated total time: 14h 25m 22s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 41s. [2026-03-25 14:57:52,550][__main__][INFO] - Starting iteration 38. [2026-03-25 14:57:52,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:57:52,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:58:00,721][__main__][INFO] - Number of regex retries in iteration 38: 0 [2026-03-25 14:58:00,722][__main__][INFO] - agents played in iteration 38 are Alice, Bob [2026-03-25 14:58:01,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:01,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:01,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:58:01,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:58:02,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:58:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:58:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:58:04,033][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:58:04,691][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:58:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:58:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:58:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:58:07,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:58:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:58:08,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:58:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:58:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:58:10,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:58:11,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:58:11,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:58:12,591][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:58:13,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:58:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:58:14,566][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:58:15,224][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:58:15,882][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:58:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:58:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:58:17,856][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:58:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:58:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:58:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:58:20,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:58:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:58:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:58:22,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:58:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:58:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:58:24,441][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:58:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:58:25,757][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:58:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:58:27,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:58:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:58:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:58:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:58:29,710][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:58:30,371][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:58:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:58:31,690][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:58:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:58:33,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:58:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:58:34,613][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:58:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:58:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:58:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:58:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:58:37,904][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:58:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:58:39,221][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:58:39,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:58:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:58:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:58:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:58:42,514][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:58:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:58:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:58:44,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:58:45,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:58:46,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:58:46,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:58:46,320][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:58:47,748][__main__][INFO] - Iteration 39 took 55s (14.80% Gen, 82.61% Train). Generation: 8s, Training: 45s. Estimated remaining time: 14h 42m 31s. Estimated total time: 15h 19m 55s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 57s. [2026-03-25 14:58:47,750][__main__][INFO] - Starting iteration 39. [2026-03-25 14:58:47,755][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:58:47,755][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:58:52,767][__main__][INFO] - Number of regex retries in iteration 39: 0 [2026-03-25 14:58:52,769][__main__][INFO] - agents played in iteration 39 are Alice, Bob [2026-03-25 14:58:53,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:53,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:53,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:58:53,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:58:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:58:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:58:55,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:58:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:58:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:58:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:58:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:58:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:58:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:58:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:59:00,654][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:59:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:59:01,970][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:59:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:59:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:59:03,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:59:04,602][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:59:05,260][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:59:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:59:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:59:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:59:07,893][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:59:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:59:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:59:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:59:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:59:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:59:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:59:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:59:13,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:59:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:59:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:59:15,132][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:59:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:59:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:59:17,108][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:59:17,768][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:59:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:59:19,085][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:59:19,744][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:59:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:59:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:59:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:59:22,377][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:59:23,035][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:59:23,694][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:59:24,352][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:59:25,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:59:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:59:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:59:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:59:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:59:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:59:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:59:29,870][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:59:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:59:31,187][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:59:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:59:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:59:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:59:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:59:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:59:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:59:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:59:36,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:59:37,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:59:38,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:59:38,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:59:38,446][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:59:39,693][__main__][INFO] - Iteration 40 took 51s (9.65% Gen, 87.94% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 47m 25s. Estimated total time: 14h 25m 40s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 50s. [2026-03-25 14:59:39,697][__main__][INFO] - Starting iteration 40. [2026-03-25 14:59:39,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:59:39,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:59:44,668][__main__][INFO] - Number of regex retries in iteration 40: 0 [2026-03-25 14:59:44,670][__main__][INFO] - agents played in iteration 40 are Alice, Bob [2026-03-25 14:59:45,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:59:45,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:59:45,341][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:59:45,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:59:45,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:59:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:59:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:59:47,911][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:59:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:59:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:59:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:59:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:59:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:59:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:59:52,520][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:59:53,178][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:59:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:59:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:59:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:59:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:59:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:59:57,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:59:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:59:58,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:59:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:59:59,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:00:00,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:00:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:00:01,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:00:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:00:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:00:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:00:04,365][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:00:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:00:05,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:00:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:00:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:00:07,654][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:00:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:00:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:00:09,628][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:00:10,286][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:00:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:00:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:00:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:00:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:00:13,575][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:00:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:00:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:00:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:00:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:00:16,867][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:00:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:00:18,433][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:00:19,091][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:00:19,749][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:00:20,407][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:00:21,064][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:00:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:00:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:00:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:00:23,698][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:00:24,355][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:00:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:00:25,671][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:00:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:00:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:00:27,651][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:00:28,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:00:29,016][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:00:30,204][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:00:30,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:00:30,209][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:00:31,548][__main__][INFO] - Iteration 41 took 51s (9.58% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 45m 1s. Estimated total time: 14h 24m 9s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2026-03-25 15:00:31,550][__main__][INFO] - Starting iteration 41. [2026-03-25 15:00:31,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:00:31,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:00:36,455][__main__][INFO] - Number of regex retries in iteration 41: 0 [2026-03-25 15:00:36,456][__main__][INFO] - agents played in iteration 41 are Alice, Bob [2026-03-25 15:00:37,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:00:37,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:00:37,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:00:37,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:00:37,802][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:00:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:00:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:00:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:00:40,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:00:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:00:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:00:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:00:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:00:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:00:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:00:45,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:00:45,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:00:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:00:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:00:47,631][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:00:48,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:00:48,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:00:49,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:00:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:00:50,920][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:00:51,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:00:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:00:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:00:53,552][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:00:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:00:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:00:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:00:56,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:00:56,842][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:00:57,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:00:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:00:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:00:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:01:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:01:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:01:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:01:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:01:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:01:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:01:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:01:04,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:01:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:01:06,056][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:01:06,714][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:01:07,374][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:01:08,031][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:01:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:01:09,596][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:01:10,255][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:01:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:01:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:01:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:01:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:01:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:01:14,202][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:01:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:01:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:01:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:01:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:01:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:01:18,152][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:01:18,810][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:01:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:01:20,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:01:20,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:01:21,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:01:21,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:01:21,889][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:01:23,197][__main__][INFO] - Iteration 42 took 51s (9.49% Gen, 87.97% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 40m 45s. Estimated total time: 14h 20m 45s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 4s, 500 more iterations: 7h 10m 22s. [2026-03-25 15:01:23,200][__main__][INFO] - Starting iteration 42. [2026-03-25 15:01:23,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:01:23,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:01:28,461][__main__][INFO] - Number of regex retries in iteration 42: 0 [2026-03-25 15:01:28,462][__main__][INFO] - agents played in iteration 42 are Alice, Bob [2026-03-25 15:01:29,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:01:29,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:01:29,133][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:01:29,133][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:01:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:01:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:01:31,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:01:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:01:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:01:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:01:33,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:01:34,349][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:01:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:01:35,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:01:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:01:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:01:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:01:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:01:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:01:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:01:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:01:40,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:01:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:01:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:01:42,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:01:43,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:01:44,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:01:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:01:45,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:01:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:01:46,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:01:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:01:48,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:01:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:01:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:01:50,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:01:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:01:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:01:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:01:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:01:53,453][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:01:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:01:54,772][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:01:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:01:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:01:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:01:57,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:01:58,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:01:58,729][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:01:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:02:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:02:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:02:01,656][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:02:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:02:02,976][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:02:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:02:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:02:04,955][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:02:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:02:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:02:06,934][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:02:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:02:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:02:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:02:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:02:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:02:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:02:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:02:12,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:02:12,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:02:14,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:02:14,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:02:14,136][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:02:15,912][__main__][INFO] - Iteration 43 took 52s (9.98% Gen, 86.65% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 57m 38s. Estimated total time: 14h 38m 30s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 15s. [2026-03-25 15:02:15,915][__main__][INFO] - Starting iteration 43. [2026-03-25 15:02:15,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:02:15,919][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:02:21,246][__main__][INFO] - Number of regex retries in iteration 43: 0 [2026-03-25 15:02:21,247][__main__][INFO] - agents played in iteration 43 are Alice, Bob [2026-03-25 15:02:21,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:21,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:21,917][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:02:21,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:02:22,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:02:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:02:23,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:02:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:02:25,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:02:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:02:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:02:27,148][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:02:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:02:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:02:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:02:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:02:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:02:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:02:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:02:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:02:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:02:33,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:02:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:02:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:02:35,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:02:36,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:02:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:02:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:02:38,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:02:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:02:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:02:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:02:40,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:02:41,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:02:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:02:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:02:43,626][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:02:44,285][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:02:44,944][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:02:45,602][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:02:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:02:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:02:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:02:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:02:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:02:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:02:50,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:02:50,875][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:02:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:02:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:02:52,852][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:02:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:02:54,412][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:02:55,071][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:02:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:02:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:02:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:02:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:02:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:02:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:02:59,686][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:03:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:03:01,005][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:03:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:03:02,325][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:03:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:03:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:03:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:03:04,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:03:05,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:03:06,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:03:06,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:03:06,819][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:03:10,664][__main__][INFO] - Iteration 44 took 54s (9.73% Gen, 83.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 30m 40s. Estimated total time: 15h 12m 26s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 13s. [2026-03-25 15:03:10,667][__main__][INFO] - Starting iteration 44. [2026-03-25 15:03:10,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:03:10,672][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:03:15,618][__main__][INFO] - Number of regex retries in iteration 44: 0 [2026-03-25 15:03:15,619][__main__][INFO] - agents played in iteration 44 are Alice, Bob [2026-03-25 15:03:16,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:03:16,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:03:16,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:03:16,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:03:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:03:17,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:03:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:03:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:03:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:03:20,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:03:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:03:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:03:22,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:03:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:03:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:03:24,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:03:24,796][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:03:25,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:03:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:03:26,771][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:03:27,430][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:03:28,088][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:03:28,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:03:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:03:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:03:30,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:03:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:03:32,041][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:03:32,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:03:33,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:03:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:03:34,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:03:35,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:03:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:03:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:03:37,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:03:37,968][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:03:38,626][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:03:39,285][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:03:39,943][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:03:40,601][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:03:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:03:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:03:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:03:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:03:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:03:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:03:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:03:45,872][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:03:46,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:03:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:03:47,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:03:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:03:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:03:50,097][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:03:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:03:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:03:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:03:52,733][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:03:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:03:54,049][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:03:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:03:55,366][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:03:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:03:56,685][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:03:57,344][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:03:58,003][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:03:58,662][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:03:59,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:04:00,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:04:01,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:04:01,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:04:01,251][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:04:02,485][__main__][INFO] - Iteration 45 took 51s (9.55% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 40m 57s. Estimated total time: 14h 23m 35s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 47s. [2026-03-25 15:04:02,488][__main__][INFO] - Starting iteration 45. [2026-03-25 15:04:02,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:04:02,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:04:08,346][__main__][INFO] - Number of regex retries in iteration 45: 0 [2026-03-25 15:04:08,347][__main__][INFO] - agents played in iteration 45 are Alice, Bob [2026-03-25 15:04:08,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:04:09,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:04:09,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:04:09,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:04:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:04:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:04:10,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:04:11,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:04:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:04:12,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:04:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:04:14,243][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:04:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:04:15,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:04:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:04:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:04:17,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:04:18,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:04:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:04:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:04:20,168][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:04:20,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:04:21,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:04:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:04:22,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:04:23,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:04:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:04:24,777][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:04:25,439][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:04:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:04:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:04:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:04:28,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:04:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:04:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:04:30,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:04:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:04:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:04:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:04:32,707][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:04:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:04:34,028][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:04:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:04:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:04:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:04:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:04:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:04:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:04:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:04:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:04:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:04:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:04:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:04:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:04:42,923][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:04:43,582][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:04:44,240][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:04:44,900][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:04:45,559][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:04:46,218][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:04:46,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:04:47,537][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:04:48,196][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:04:48,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:04:49,512][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:04:50,170][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:04:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:04:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:04:52,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:04:52,982][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:04:54,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:04:54,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:04:54,108][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:04:55,951][__main__][INFO] - Iteration 46 took 53s (10.95% Gen, 85.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 7m 28s. Estimated total time: 14h 51m 0s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 6s, 500 more iterations: 7h 25m 30s. [2026-03-25 15:04:55,953][__main__][INFO] - Starting iteration 46. [2026-03-25 15:04:55,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:04:55,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:05:00,954][__main__][INFO] - Number of regex retries in iteration 46: 0 [2026-03-25 15:05:00,955][__main__][INFO] - agents played in iteration 46 are Alice, Bob [2026-03-25 15:05:01,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:01,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:01,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:05:01,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:05:02,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:05:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:05:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:05:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:05:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:05:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:05:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:05:06,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:05:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:05:08,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:05:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:05:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:05:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:05:10,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:05:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:05:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:05:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:05:13,451][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:05:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:05:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:05:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:05:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:05:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:05:17,403][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:05:18,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:05:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:05:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:05:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:05:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:05:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:05:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:05:22,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:05:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:05:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:05:24,647][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:05:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:05:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:05:26,621][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:05:27,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:05:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:05:28,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:05:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:05:29,914][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:05:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:05:31,232][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:05:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:05:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:05:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:05:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:05:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:05:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:05:36,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:05:36,827][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:05:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:05:38,146][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:05:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:05:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:05:40,124][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:05:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:05:41,441][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:05:42,100][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:05:42,758][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:05:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:05:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:05:44,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:05:45,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:05:46,652][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:05:46,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:05:46,657][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:05:48,128][__main__][INFO] - Iteration 47 took 52s (9.57% Gen, 87.60% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 45m 8s. Estimated total time: 14h 29m 32s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 46s. [2026-03-25 15:05:48,131][__main__][INFO] - Starting iteration 47. [2026-03-25 15:05:48,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:05:48,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:05:52,956][__main__][INFO] - Number of regex retries in iteration 47: 0 [2026-03-25 15:05:52,958][__main__][INFO] - agents played in iteration 47 are Alice, Bob [2026-03-25 15:05:53,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:53,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:53,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:05:53,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:05:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:05:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:05:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:05:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:05:56,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:05:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:05:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:05:58,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:05:59,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:06:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:06:00,816][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:06:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:06:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:06:02,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:06:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:06:04,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:06:04,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:06:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:06:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:06:06,739][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:06:07,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:06:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:06:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:06:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:06:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:06:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:06:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:06:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:06:12,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:06:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:06:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:06:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:06:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:06:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:06:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:06:17,275][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:06:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:06:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:06:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:06:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:06:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:06:21,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:06:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:06:22,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:06:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:06:23,859][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:06:24,517][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:06:25,176][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:06:26,150][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:06:26,809][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:06:27,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:06:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:06:28,783][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:06:29,442][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:06:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:06:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:06:31,418][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:06:32,076][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:06:32,736][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:06:33,395][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:06:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:06:34,712][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:06:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:06:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:06:36,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:06:37,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:06:38,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:06:38,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:06:38,513][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:06:39,896][__main__][INFO] - Iteration 48 took 51s (9.32% Gen, 88.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 37m 27s. Estimated total time: 14h 22m 43s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 21s. [2026-03-25 15:06:39,899][__main__][INFO] - Starting iteration 48. [2026-03-25 15:06:39,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:06:39,903][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:06:45,067][__main__][INFO] - Number of regex retries in iteration 48: 0 [2026-03-25 15:06:45,068][__main__][INFO] - agents played in iteration 48 are Alice, Bob [2026-03-25 15:06:45,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:06:45,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:06:45,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:06:45,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:06:46,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:06:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:06:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:06:48,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:06:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:06:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:06:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:06:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:06:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:06:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:06:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:06:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:06:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:06:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:06:55,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:06:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:06:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:06:57,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:06:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:06:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:06:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:07:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:07:00,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:07:01,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:07:02,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:07:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:07:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:07:04,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:07:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:07:05,466][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:07:06,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:07:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:07:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:07:08,100][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:07:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:07:09,416][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:07:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:07:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:07:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:07:12,050][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:07:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:07:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:07:14,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:07:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:07:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:07:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:07:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:07:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:07:18,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:07:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:07:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:07:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:07:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:07:21,594][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:07:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:07:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:07:23,569][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:07:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:07:24,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:07:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:07:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:07:26,862][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:07:27,521][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:07:28,179][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:07:28,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:07:29,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:07:30,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:07:30,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:07:30,647][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:07:31,995][__main__][INFO] - Iteration 49 took 52s (9.91% Gen, 87.49% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 42m 5s. Estimated total time: 14h 28m 13s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 6s. [2026-03-25 15:07:31,998][__main__][INFO] - Starting iteration 49. [2026-03-25 15:07:32,001][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:07:32,002][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:07:36,811][__main__][INFO] - Number of regex retries in iteration 49: 0 [2026-03-25 15:07:36,813][__main__][INFO] - agents played in iteration 49 are Alice, Bob [2026-03-25 15:07:37,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:07:37,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:07:37,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:07:37,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:07:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:07:38,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:07:39,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:07:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:07:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:07:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:07:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:07:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:07:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:07:44,009][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:07:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:07:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:07:45,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:07:46,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:07:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:07:47,957][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:07:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:07:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:07:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:07:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:07:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:07:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:07:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:07:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:07:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:07:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:07:55,195][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:07:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:07:56,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:07:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:07:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:07:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:07:59,145][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:07:59,803][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:08:00,462][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:08:01,120][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:08:01,778][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:08:02,436][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:08:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:08:03,754][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:08:04,412][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:08:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:08:05,729][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:08:06,387][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:08:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:08:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:08:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:08:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:08:09,993][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:08:10,651][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:08:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:08:11,967][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:08:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:08:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:08:13,941][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:08:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:08:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:08:15,918][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:08:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:08:17,234][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:08:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:08:18,552][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:08:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:08:19,870][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:08:20,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:08:21,240][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:08:22,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:08:22,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:08:22,359][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:08:23,704][__main__][INFO] - Iteration 50 took 51s (9.30% Gen, 88.09% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 34m 45s. Estimated total time: 14h 21m 44s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 52s. [2026-03-25 15:08:23,707][__main__][INFO] - Starting iteration 50. [2026-03-25 15:08:23,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:08:23,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:08:28,899][__main__][INFO] - Number of regex retries in iteration 50: 0 [2026-03-25 15:08:28,900][__main__][INFO] - agents played in iteration 50 are Alice, Bob [2026-03-25 15:08:29,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:08:29,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:08:29,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:08:29,555][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:08:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:08:30,830][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:08:31,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:08:32,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:08:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:08:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:08:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:08:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:08:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:08:36,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:08:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:08:37,413][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:08:38,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:08:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:08:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:08:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:08:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:08:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:08:42,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:08:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:08:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:08:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:08:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:08:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:08:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:08:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:08:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:08:47,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:08:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:08:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:08:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:08:50,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:08:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:08:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:08:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:08:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:08:53,897][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:08:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:08:55,217][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:08:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:08:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:08:57,194][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:08:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:08:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:08:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:08:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:09:00,490][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:09:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:09:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:09:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:09:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:09:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:09:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:09:05,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:09:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:09:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:09:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:09:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:09:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:09:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:09:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:09:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:09:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:09:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:09:12,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:09:13,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:09:14,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:09:14,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:09:14,495][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:09:17,342][__main__][INFO] - Iteration 51 took 53s (9.67% Gen, 85.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 5m 59s. Estimated total time: 14h 53m 52s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 23s, 500 more iterations: 7h 26m 56s. [2026-03-25 15:09:17,345][__main__][INFO] - Starting iteration 51. [2026-03-25 15:09:17,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:09:17,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:09:22,161][__main__][INFO] - Number of regex retries in iteration 51: 0 [2026-03-25 15:09:22,162][__main__][INFO] - agents played in iteration 51 are Alice, Bob [2026-03-25 15:09:22,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:09:22,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:09:22,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:09:22,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:09:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:09:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:09:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:09:25,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:09:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:09:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:09:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:09:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:09:28,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:09:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:09:30,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:09:30,659][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:09:31,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:09:31,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:09:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:09:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:09:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:09:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:09:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:09:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:09:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:09:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:09:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:09:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:09:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:09:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:09:40,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:09:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:09:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:09:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:09:43,190][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:09:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:09:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:09:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:09:45,827][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:09:46,486][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:09:47,145][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:09:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:09:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:09:49,124][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:09:49,783][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:09:50,443][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:09:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:09:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:09:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:09:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:09:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:09:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:09:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:09:56,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:09:56,665][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:09:57,324][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:09:57,983][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:09:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:09:59,301][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:09:59,960][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:10:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:10:01,277][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:10:01,937][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:10:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:10:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:10:03,913][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:10:04,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:10:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:10:05,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:10:06,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:10:07,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:10:07,825][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:10:07,826][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:10:09,131][__main__][INFO] - Iteration 52 took 51s (9.29% Gen, 88.18% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 34m 18s. Estimated total time: 14h 23m 3s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 31s. [2026-03-25 15:10:09,134][__main__][INFO] - Starting iteration 52. [2026-03-25 15:10:09,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:10:09,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:10:14,048][__main__][INFO] - Number of regex retries in iteration 52: 0 [2026-03-25 15:10:14,050][__main__][INFO] - agents played in iteration 52 are Alice, Bob [2026-03-25 15:10:14,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:10:14,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:10:14,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:10:14,697][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:10:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:10:15,966][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:10:16,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:10:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:10:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:10:18,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:10:19,262][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:10:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:10:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:10:21,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:10:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:10:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:10:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:10:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:10:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:10:25,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:10:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:10:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:10:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:10:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:10:28,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:10:29,142][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:10:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:10:30,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:10:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:10:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:10:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:10:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:10:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:10:34,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:10:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:10:35,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:10:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:10:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:10:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:10:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:10:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:10:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:10:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:10:41,004][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:10:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:10:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:10:42,981][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:10:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:10:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:10:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:10:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:10:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:10:47,245][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:10:47,904][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:10:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:10:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:10:49,881][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:10:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:10:51,201][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:10:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:10:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:10:53,179][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:10:53,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:10:54,497][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:10:55,157][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:10:55,816][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:10:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:10:57,134][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:10:57,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:10:58,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:10:59,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:10:59,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:10:59,626][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:11:01,089][__main__][INFO] - Iteration 53 took 51s (9.45% Gen, 87.72% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 36m 16s. Estimated total time: 14h 25m 53s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 56s. [2026-03-25 15:11:01,092][__main__][INFO] - Starting iteration 53. [2026-03-25 15:11:01,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:11:01,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:11:06,050][__main__][INFO] - Number of regex retries in iteration 53: 0 [2026-03-25 15:11:06,052][__main__][INFO] - agents played in iteration 53 are Alice, Bob [2026-03-25 15:11:06,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:06,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:06,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:11:06,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:11:07,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:11:07,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:11:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:11:09,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:11:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:11:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:11:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:11:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:11:12,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:11:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:11:13,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:11:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:11:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:11:15,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:11:16,512][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:11:17,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:11:17,829][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:11:18,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:11:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:11:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:11:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:11:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:11:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:11:22,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:11:23,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:11:23,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:11:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:11:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:11:25,731][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:11:26,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:11:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:11:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:11:28,367][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:11:29,027][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:11:29,686][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:11:30,345][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:11:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:11:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:11:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:11:32,979][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:11:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:11:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:11:34,956][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:11:35,615][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:11:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:11:36,933][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:11:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:11:38,250][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:11:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:11:39,811][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:11:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:11:41,129][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:11:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:11:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:11:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:11:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:11:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:11:45,094][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:11:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:11:46,417][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:11:47,077][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:11:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:11:48,399][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:11:49,060][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:11:49,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:11:50,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:11:51,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:11:51,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:11:51,649][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:11:52,971][__main__][INFO] - Iteration 54 took 51s (9.55% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 34m 8s. Estimated total time: 14h 24m 37s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 18s. [2026-03-25 15:11:52,974][__main__][INFO] - Starting iteration 54. [2026-03-25 15:11:52,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:11:52,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:11:57,897][__main__][INFO] - Number of regex retries in iteration 54: 0 [2026-03-25 15:11:57,899][__main__][INFO] - agents played in iteration 54 are Alice, Bob [2026-03-25 15:11:58,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:58,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:58,556][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:11:58,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:11:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:12:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:12:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:12:01,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:12:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:12:02,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:12:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:12:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:12:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:12:05,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:12:06,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:12:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:12:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:12:08,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:12:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:12:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:12:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:12:10,788][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:12:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:12:12,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:12:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:12:13,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:12:14,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:12:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:12:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:12:16,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:12:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:12:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:12:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:12:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:12:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:12:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:12:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:12:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:12:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:12:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:12:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:12:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:12:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:12:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:12:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:12:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:12:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:12:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:12:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:12:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:12:29,949][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:12:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:12:31,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:12:32,278][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:12:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:12:33,595][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:12:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:12:34,912][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:12:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:12:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:12:36,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:12:37,548][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:12:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:12:38,865][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:12:39,523][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:12:40,182][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:12:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:12:41,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:12:42,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:12:43,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:12:44,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:12:44,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:12:44,178][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:12:45,642][__main__][INFO] - Iteration 55 took 52s (9.34% Gen, 87.87% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 46m 24s. Estimated total time: 14h 37m 45s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 52s. [2026-03-25 15:12:45,645][__main__][INFO] - Starting iteration 55. [2026-03-25 15:12:45,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:12:45,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:12:50,536][__main__][INFO] - Number of regex retries in iteration 55: 0 [2026-03-25 15:12:50,538][__main__][INFO] - agents played in iteration 55 are Alice, Bob [2026-03-25 15:12:51,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:12:51,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:12:51,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:12:51,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:12:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:12:52,596][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:12:53,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:12:53,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:12:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:12:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:12:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:12:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:12:57,200][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:12:57,857][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:12:58,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:12:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:12:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:13:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:13:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:13:01,804][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:13:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:13:03,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:13:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:13:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:13:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:13:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:13:06,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:13:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:13:07,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:13:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:13:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:13:09,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:13:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:13:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:13:11,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:13:12,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:13:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:13:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:13:14,313][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:13:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:13:15,630][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:13:16,289][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:13:16,947][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:13:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:13:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:13:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:13:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:13:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:13:20,898][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:13:21,556][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:13:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:13:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:13:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:13:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:13:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:13:25,825][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:13:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:13:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:13:27,801][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:13:28,459][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:13:29,118][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:13:29,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:13:30,436][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:13:31,097][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:13:31,758][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:13:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:13:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:13:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:13:34,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:13:35,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:13:36,727][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:13:36,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:13:36,732][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:13:38,079][__main__][INFO] - Iteration 56 took 52s (9.32% Gen, 88.10% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 41m 37s. Estimated total time: 14h 33m 51s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 55s. [2026-03-25 15:13:38,081][__main__][INFO] - Starting iteration 56. [2026-03-25 15:13:38,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:13:38,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:13:43,036][__main__][INFO] - Number of regex retries in iteration 56: 0 [2026-03-25 15:13:43,037][__main__][INFO] - agents played in iteration 56 are Alice, Bob [2026-03-25 15:13:43,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:13:43,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:13:43,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:13:43,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:13:44,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:13:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:13:45,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:13:46,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:13:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:13:47,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:13:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:13:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:13:49,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:13:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:13:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:13:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:13:52,225][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:13:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:13:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:13:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:13:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:13:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:13:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:13:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:13:57,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:13:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:13:58,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:13:59,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:14:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:14:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:14:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:14:02,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:14:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:14:03,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:14:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:14:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:14:05,385][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:14:06,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:14:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:14:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:14:08,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:14:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:14:09,334][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:14:09,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:14:10,650][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:14:11,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:14:11,967][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:14:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:14:13,283][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:14:13,941][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:14:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:14:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:14:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:14:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:14:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:14:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:14:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:14:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:14:20,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:14:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:14:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:14:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:14:22,819][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:14:23,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:14:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:14:24,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:14:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:14:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:14:26,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:14:27,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:14:28,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:14:28,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:14:28,680][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:14:29,994][__main__][INFO] - Iteration 57 took 51s (9.54% Gen, 87.93% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 32m 4s. Estimated total time: 14h 25m 10s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 35s. [2026-03-25 15:14:29,996][__main__][INFO] - Starting iteration 57. [2026-03-25 15:14:30,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:14:30,001][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:14:34,844][__main__][INFO] - Number of regex retries in iteration 57: 0 [2026-03-25 15:14:34,845][__main__][INFO] - agents played in iteration 57 are Alice, Bob [2026-03-25 15:14:35,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:14:35,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:14:35,502][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:14:35,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:14:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:14:36,774][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:14:37,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:14:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:14:38,749][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:14:39,408][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:14:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:14:40,725][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:14:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:14:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:14:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:14:43,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:14:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:14:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:14:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:14:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:14:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:14:47,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:14:47,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:14:48,633][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:14:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:14:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:14:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:14:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:14:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:14:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:14:53,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:14:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:14:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:14:55,225][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:14:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:14:56,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:14:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:14:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:14:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:14:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:14:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:15:00,501][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:15:01,161][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:15:01,821][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:15:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:15:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:15:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:15:04,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:15:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:15:05,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:15:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:15:07,096][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:15:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:15:08,737][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:15:09,397][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:15:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:15:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:15:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:15:12,036][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:15:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:15:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:15:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:15:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:15:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:15:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:15:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:15:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:15:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:15:18,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:15:19,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:15:20,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:15:20,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:15:20,414][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:15:21,925][__main__][INFO] - Iteration 58 took 51s (9.33% Gen, 87.76% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 31m 28s. Estimated total time: 14h 25m 26s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 43s. [2026-03-25 15:15:21,928][__main__][INFO] - Starting iteration 58. [2026-03-25 15:15:21,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:15:21,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:15:27,089][__main__][INFO] - Number of regex retries in iteration 58: 0 [2026-03-25 15:15:27,091][__main__][INFO] - agents played in iteration 58 are Alice, Bob [2026-03-25 15:15:27,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:27,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:27,748][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:15:27,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:15:28,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:15:29,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:15:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:15:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:15:31,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:15:31,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:15:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:15:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:15:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:15:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:15:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:15:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:15:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:15:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:15:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:15:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:15:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:15:39,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:15:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:15:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:15:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:15:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:15:42,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:15:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:15:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:15:44,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:15:45,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:15:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:15:46,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:15:47,498][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:15:48,157][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:15:48,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:15:49,476][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:15:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:15:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:15:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:15:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:15:52,773][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:15:53,432][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:15:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:15:54,752][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:15:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:15:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:15:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:15:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:15:58,050][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:15:58,709][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:15:59,368][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:16:00,318][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:16:00,977][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:16:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:16:02,295][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:16:02,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:16:03,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:16:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:16:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:16:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:16:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:16:06,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:16:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:16:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:16:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:16:09,548][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:16:10,208][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:16:10,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:16:11,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:16:12,710][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:16:12,713][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:16:12,714][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:16:14,376][__main__][INFO] - Iteration 59 took 52s (9.83% Gen, 86.99% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 39m 15s. Estimated total time: 14h 34m 6s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 24s, 500 more iterations: 7h 17m 3s. [2026-03-25 15:16:14,379][__main__][INFO] - Starting iteration 59. [2026-03-25 15:16:14,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:16:14,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:16:19,287][__main__][INFO] - Number of regex retries in iteration 59: 0 [2026-03-25 15:16:19,289][__main__][INFO] - agents played in iteration 59 are Alice, Bob [2026-03-25 15:16:19,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:16:19,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:16:19,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:16:19,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:16:20,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:16:21,224][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:16:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:16:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:16:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:16:23,858][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:16:24,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:16:25,174][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:16:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:16:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:16:27,150][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:16:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:16:28,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:16:29,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:16:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:16:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:16:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:16:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:16:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:16:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:16:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:16:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:16:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:16:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:16:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:16:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:16:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:16:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:16:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:16:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:16:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:16:40,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:16:41,643][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:16:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:16:42,961][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:16:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:16:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:16:44,938][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:16:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:16:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:16:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:16:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:16:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:16:48,891][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:16:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:16:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:16:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:16:51,530][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:16:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:16:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:16:53,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:16:54,475][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:16:55,134][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:16:55,794][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:16:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:16:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:16:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:16:58,429][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:16:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:16:59,746][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:17:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:17:01,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:17:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:17:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:17:03,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:17:03,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:17:04,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:17:04,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:17:04,738][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:17:06,165][__main__][INFO] - Iteration 60 took 51s (9.47% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 27m 21s. Estimated total time: 14h 23m 4s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 32s. [2026-03-25 15:17:06,168][__main__][INFO] - Starting iteration 60. [2026-03-25 15:17:06,172][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:17:06,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:17:11,317][__main__][INFO] - Number of regex retries in iteration 60: 0 [2026-03-25 15:17:11,318][__main__][INFO] - agents played in iteration 60 are Alice, Bob [2026-03-25 15:17:11,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:17:11,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:17:11,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:17:11,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:17:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:17:13,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:17:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:17:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:17:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:17:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:17:16,530][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:17:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:17:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:17:18,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:17:19,165][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:17:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:17:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:17:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:17:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:17:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:17:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:17:23,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:17:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:17:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:17:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:17:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:17:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:17:27,739][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:17:28,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:17:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:17:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:17:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:17:31,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:17:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:17:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:17:33,013][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:17:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:17:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:17:34,988][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:17:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:17:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:17:36,965][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:17:37,623][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:17:38,282][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:17:38,940][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:17:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:17:40,259][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:17:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:17:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:17:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:17:42,892][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:17:43,550][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:17:44,497][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:17:45,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:17:45,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:17:46,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:17:47,129][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:17:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:17:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:17:49,106][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:17:49,764][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:17:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:17:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:17:51,740][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:17:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:17:53,058][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:17:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:17:54,376][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:17:55,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:17:55,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:17:56,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:17:56,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:17:56,892][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:17:58,263][__main__][INFO] - Iteration 61 took 52s (9.88% Gen, 87.49% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 31m 38s. Estimated total time: 14h 28m 13s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 6s. [2026-03-25 15:17:58,269][__main__][INFO] - Starting iteration 61. [2026-03-25 15:17:58,273][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:17:58,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:18:03,236][__main__][INFO] - Number of regex retries in iteration 61: 0 [2026-03-25 15:18:03,237][__main__][INFO] - agents played in iteration 61 are Alice, Bob [2026-03-25 15:18:03,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:03,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:03,940][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:18:03,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:18:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:18:05,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:18:05,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:18:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:18:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:18:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:18:08,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:18:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:18:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:18:10,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:18:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:18:11,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:18:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:18:13,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:18:13,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:18:14,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:18:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:18:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:18:16,398][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:18:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:18:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:18:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:18:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:18:19,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:18:20,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:18:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:18:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:18:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:18:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:18:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:18:24,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:18:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:18:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:18:26,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:18:26,936][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:18:27,595][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:18:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:18:28,914][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:18:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:18:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:18:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:18:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:18:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:18:32,917][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:18:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:18:34,238][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:18:34,900][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:18:35,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:18:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:18:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:18:38,071][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:18:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:18:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:18:40,060][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:18:40,721][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:18:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:18:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:18:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:18:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:18:44,041][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:18:44,703][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:18:45,368][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:18:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:18:46,688][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:18:47,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:18:48,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:18:49,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:18:49,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:18:49,484][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:18:50,731][__main__][INFO] - Iteration 62 took 52s (9.46% Gen, 88.16% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 36m 53s. Estimated total time: 14h 34m 20s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 10s. [2026-03-25 15:18:50,734][__main__][INFO] - Starting iteration 62. [2026-03-25 15:18:50,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:18:50,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:18:55,868][__main__][INFO] - Number of regex retries in iteration 62: 0 [2026-03-25 15:18:55,870][__main__][INFO] - agents played in iteration 62 are Alice, Bob [2026-03-25 15:18:56,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:56,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:56,607][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:18:56,607][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:18:57,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:18:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:18:58,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:18:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:19:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:19:00,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:19:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:19:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:19:02,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:19:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:19:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:19:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:19:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:19:06,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:19:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:19:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:19:08,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:19:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:19:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:19:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:19:10,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:19:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:19:12,015][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:19:12,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:19:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:19:13,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:19:14,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:19:15,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:19:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:19:16,636][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:19:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:19:17,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:19:18,618][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:19:19,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:19:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:19:20,599][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:19:21,259][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:19:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:19:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:19:23,241][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:19:23,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:19:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:19:25,220][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:19:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:19:26,540][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:19:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:19:27,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:19:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:19:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:19:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:19:30,883][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:19:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:19:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:19:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:19:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:19:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:19:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:19:35,503][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:19:36,164][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:19:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:19:37,485][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:19:38,146][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:19:38,805][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:19:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:19:40,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:19:40,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:19:41,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:19:41,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:19:41,997][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:19:43,533][__main__][INFO] - Iteration 63 took 52s (9.72% Gen, 87.37% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 41m 37s. Estimated total time: 14h 39m 56s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 58s. [2026-03-25 15:19:43,536][__main__][INFO] - Starting iteration 63. [2026-03-25 15:19:43,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:19:43,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:19:48,656][__main__][INFO] - Number of regex retries in iteration 63: 0 [2026-03-25 15:19:48,657][__main__][INFO] - agents played in iteration 63 are Alice, Bob [2026-03-25 15:19:49,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:19:49,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:19:49,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:19:49,490][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:19:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:19:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:19:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:19:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:19:52,750][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:19:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:19:54,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:19:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:19:55,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:19:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:19:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:19:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:19:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:19:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:19:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:20:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:20:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:20:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:20:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:20:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:20:03,302][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:20:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:20:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:20:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:20:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:20:06,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:20:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:20:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:20:08,580][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:20:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:20:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:20:10,559][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:20:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:20:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:20:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:20:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:20:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:20:14,519][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:20:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:20:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:20:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:20:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:20:17,819][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:20:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:20:19,140][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:20:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:20:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:20:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:20:22,103][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:20:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:20:23,420][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:20:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:20:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:20:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:20:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:20:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:20:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:20:28,038][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:20:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:20:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:20:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:20:30,676][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:20:31,336][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:20:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:20:32,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:20:33,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:20:34,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:20:34,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:20:34,563][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:20:35,987][__main__][INFO] - Iteration 64 took 52s (9.76% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 34m 57s. Estimated total time: 14h 34m 9s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 24s, 500 more iterations: 7h 17m 4s. [2026-03-25 15:20:35,990][__main__][INFO] - Starting iteration 64. [2026-03-25 15:20:35,996][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:20:35,997][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:20:40,831][__main__][INFO] - Number of regex retries in iteration 64: 0 [2026-03-25 15:20:40,833][__main__][INFO] - agents played in iteration 64 are Alice, Bob [2026-03-25 15:20:41,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:20:41,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:20:41,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:20:41,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:20:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:20:42,749][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:20:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:20:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:20:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:20:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:20:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:20:46,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:20:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:20:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:20:48,678][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:20:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:20:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:20:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:20:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:20:51,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:20:52,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:20:53,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:20:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:20:54,607][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:20:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:20:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:20:56,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:20:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:20:57,901][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:20:58,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:20:59,219][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:20:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:21:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:21:01,196][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:21:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:21:02,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:21:03,173][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:21:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:21:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:21:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:21:05,809][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:21:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:21:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:21:07,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:21:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:21:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:21:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:21:10,421][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:21:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:21:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:21:12,399][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:21:13,058][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:21:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:21:14,696][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:21:15,355][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:21:16,013][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:21:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:21:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:21:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:21:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:21:19,309][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:21:19,967][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:21:20,626][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:21:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:21:21,945][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:21:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:21:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:21:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:21:24,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:21:25,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:21:26,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:21:26,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:21:26,606][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:21:28,008][__main__][INFO] - Iteration 65 took 52s (9.30% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 26m 51s. Estimated total time: 14h 26m 55s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 27s. [2026-03-25 15:21:28,011][__main__][INFO] - Starting iteration 65. [2026-03-25 15:21:28,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:21:28,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:21:32,911][__main__][INFO] - Number of regex retries in iteration 65: 0 [2026-03-25 15:21:32,913][__main__][INFO] - agents played in iteration 65 are Alice, Bob [2026-03-25 15:21:33,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:21:33,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:21:33,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:21:33,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:21:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:21:34,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:21:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:21:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:21:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:21:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:21:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:21:38,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:21:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:21:40,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:21:40,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:21:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:21:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:21:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:21:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:21:44,130][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:21:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:21:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:21:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:21:46,766][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:21:47,426][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:21:48,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:21:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:21:49,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:21:50,065][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:21:50,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:21:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:21:52,041][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:21:52,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:21:53,359][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:21:54,019][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:21:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:21:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:21:55,998][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:21:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:21:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:21:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:21:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:21:59,292][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:21:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:22:00,611][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:22:01,270][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:22:01,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:22:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:22:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:22:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:22:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:22:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:22:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:22:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:22:07,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:22:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:22:08,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:22:09,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:22:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:22:10,774][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:22:11,433][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:22:12,093][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:22:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:22:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:22:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:22:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:22:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:22:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:22:16,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:22:17,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:22:19,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:22:19,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:22:19,404][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:22:20,907][__main__][INFO] - Iteration 66 took 52s (9.26% Gen, 87.90% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 40m 37s. Estimated total time: 14h 41m 33s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 9s, 500 more iterations: 7h 20m 46s. [2026-03-25 15:22:20,909][__main__][INFO] - Starting iteration 66. [2026-03-25 15:22:20,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:22:20,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:22:26,767][__main__][INFO] - Number of regex retries in iteration 66: 0 [2026-03-25 15:22:26,769][__main__][INFO] - agents played in iteration 66 are Alice, Bob [2026-03-25 15:22:27,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:22:27,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:22:27,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:22:27,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:22:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:22:28,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:22:29,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:22:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:22:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:22:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:22:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:22:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:22:33,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:22:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:22:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:22:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:22:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:22:36,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:22:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:22:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:22:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:22:39,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:22:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:22:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:22:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:22:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:22:42,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:22:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:22:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:22:44,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:22:45,334][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:22:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:22:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:22:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:22:47,971][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:22:48,630][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:22:49,288][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:22:49,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:22:50,606][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:22:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:22:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:22:52,584][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:22:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:22:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:22:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:22:55,218][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:22:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:22:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:22:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:22:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:22:58,513][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:22:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:23:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:23:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:23:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:23:02,096][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:23:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:23:03,415][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:23:04,074][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:23:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:23:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:23:06,052][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:23:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:23:07,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:23:08,029][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:23:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:23:09,348][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:23:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:23:10,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:23:11,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:23:12,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:23:12,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:23:12,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:23:14,195][__main__][INFO] - Iteration 67 took 53s (10.99% Gen, 86.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 46m 13s. Estimated total time: 14h 48m 3s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 1s. [2026-03-25 15:23:14,198][__main__][INFO] - Starting iteration 67. [2026-03-25 15:23:14,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:23:14,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:23:19,835][__main__][INFO] - Number of regex retries in iteration 67: 0 [2026-03-25 15:23:19,837][__main__][INFO] - agents played in iteration 67 are Alice, Bob [2026-03-25 15:23:20,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:20,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:20,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:23:20,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:23:21,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:23:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:23:22,628][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:23:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:23:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:23:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:23:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:23:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:23:26,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:23:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:23:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:23:28,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:23:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:23:29,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:23:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:23:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:23:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:23:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:23:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:23:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:23:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:23:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:23:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:23:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:23:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:23:37,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:23:38,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:23:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:23:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:23:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:23:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:23:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:23:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:23:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:23:43,699][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:23:44,357][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:23:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:23:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:23:46,333][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:23:46,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:23:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:23:48,311][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:23:48,970][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:23:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:23:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:23:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:23:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:23:52,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:23:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:23:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:23:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:23:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:23:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:23:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:23:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:23:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:23:58,506][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:23:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:23:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:24:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:24:01,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:24:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:24:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:24:03,118][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:24:03,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:24:04,469][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:24:05,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:24:05,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:24:05,598][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:24:07,020][__main__][INFO] - Iteration 68 took 52s (10.67% Gen, 86.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 37m 37s. Estimated total time: 14h 40m 20s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 10s. [2026-03-25 15:24:07,023][__main__][INFO] - Starting iteration 68. [2026-03-25 15:24:07,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:24:07,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:24:11,910][__main__][INFO] - Number of regex retries in iteration 68: 0 [2026-03-25 15:24:11,912][__main__][INFO] - agents played in iteration 68 are Alice, Bob [2026-03-25 15:24:12,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:24:12,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:24:12,718][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:24:12,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:24:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:24:14,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:24:14,760][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:24:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:24:16,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:24:16,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:24:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:24:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:24:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:24:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:24:20,041][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:24:20,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:24:21,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:24:22,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:24:22,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:24:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:24:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:24:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:24:25,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:24:25,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:24:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:24:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:24:27,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:24:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:24:29,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:24:29,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:24:30,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:24:31,268][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:24:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:24:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:24:33,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:24:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:24:34,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:24:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:24:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:24:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:24:37,217][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:24:37,879][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:24:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:24:39,202][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:24:39,864][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:24:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:24:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:24:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:24:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:24:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:24:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:24:44,488][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:24:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:24:46,134][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:24:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:24:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:24:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:24:48,776][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:24:49,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:24:50,096][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:24:50,757][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:24:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:24:52,078][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:24:52,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:24:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:24:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:24:54,721][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:24:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:24:56,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:24:56,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:24:57,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:24:57,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:24:57,814][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:24:59,230][__main__][INFO] - Iteration 69 took 52s (9.35% Gen, 87.93% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 26m 29s. Estimated total time: 14h 30m 4s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 2s. [2026-03-25 15:24:59,232][__main__][INFO] - Starting iteration 69. [2026-03-25 15:24:59,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:24:59,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:25:04,824][__main__][INFO] - Number of regex retries in iteration 69: 0 [2026-03-25 15:25:04,826][__main__][INFO] - agents played in iteration 69 are Alice, Bob [2026-03-25 15:25:05,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:05,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:05,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:25:05,573][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:25:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:25:06,867][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:25:07,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:25:08,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:25:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:25:09,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:25:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:25:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:25:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:25:12,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:25:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:25:13,473][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:25:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:25:14,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:25:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:25:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:25:16,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:25:17,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:25:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:25:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:25:19,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:25:20,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:25:20,734][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:25:21,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:25:22,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:25:22,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:25:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:25:24,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:25:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:25:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:25:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:25:26,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:25:27,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:25:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:25:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:25:29,314][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:25:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:25:30,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:25:31,295][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:25:31,956][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:25:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:25:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:25:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:25:34,598][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:25:35,258][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:25:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:25:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:25:37,239][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:25:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:25:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:25:39,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:25:40,175][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:25:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:25:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:25:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:25:42,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:25:43,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:25:44,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:25:44,794][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:25:45,455][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:25:46,114][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:25:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:25:47,434][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:25:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:25:48,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:25:49,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:25:50,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:25:50,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:25:50,569][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:25:52,140][__main__][INFO] - Iteration 70 took 52s (10.56% Gen, 86.46% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 37m 18s. Estimated total time: 14h 41m 46s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 53s. [2026-03-25 15:25:52,144][__main__][INFO] - Starting iteration 70. [2026-03-25 15:25:52,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:25:52,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:25:57,578][__main__][INFO] - Number of regex retries in iteration 70: 0 [2026-03-25 15:25:57,579][__main__][INFO] - agents played in iteration 70 are Alice, Bob [2026-03-25 15:25:58,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:58,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:58,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:25:58,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:25:59,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:25:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:26:00,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:26:00,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:26:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:26:02,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:26:02,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:26:03,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:26:04,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:26:04,937][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:26:05,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:26:06,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:26:06,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:26:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:26:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:26:08,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:26:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:26:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:26:10,872][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:26:11,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:26:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:26:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:26:13,509][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:26:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:26:14,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:26:15,487][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:26:16,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:26:16,806][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:26:17,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:26:18,125][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:26:18,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:26:19,444][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:26:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:26:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:26:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:26:22,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:26:22,741][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:26:23,400][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:26:24,060][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:26:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:26:25,377][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:26:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:26:26,695][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:26:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:26:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:26:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:26:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:26:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:26:30,880][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:26:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:26:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:26:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:26:33,520][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:26:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:26:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:26:35,498][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:26:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:26:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:26:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:26:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:26:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:26:39,454][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:26:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:26:40,773][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:26:41,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:26:42,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:26:43,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:26:43,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:26:43,850][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:26:45,273][__main__][INFO] - Iteration 71 took 53s (10.22% Gen, 87.09% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 40m 6s. Estimated total time: 14h 45m 27s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 32s, 500 more iterations: 7h 22m 43s. [2026-03-25 15:26:45,278][__main__][INFO] - Starting iteration 71. [2026-03-25 15:26:45,282][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:26:45,283][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:26:50,406][__main__][INFO] - Number of regex retries in iteration 71: 0 [2026-03-25 15:26:50,408][__main__][INFO] - agents played in iteration 71 are Alice, Bob [2026-03-25 15:26:51,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:26:51,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:26:51,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:26:51,209][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:26:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:26:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:26:53,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:26:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:26:54,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:26:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:26:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:26:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:26:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:26:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:26:58,591][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:26:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:26:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:27:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:27:01,224][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:27:01,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:27:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:27:03,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:27:03,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:27:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:27:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:27:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:27:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:27:07,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:27:07,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:27:08,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:27:09,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:27:09,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:27:10,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:27:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:27:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:27:12,438][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:27:13,098][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:27:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:27:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:27:15,077][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:27:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:27:16,396][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:27:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:27:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:27:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:27:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:27:19,697][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:27:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:27:21,017][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:27:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:27:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:27:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:27:24,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:27:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:27:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:27:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:27:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:27:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:27:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:27:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:27:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:27:30,024][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:27:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:27:31,348][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:27:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:27:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:27:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:27:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:27:34,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:27:35,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:27:36,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:27:36,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:27:36,595][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:27:38,067][__main__][INFO] - Iteration 72 took 52s (9.71% Gen, 87.50% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 33m 32s. Estimated total time: 14h 39m 46s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 58s, 500 more iterations: 7h 19m 53s. [2026-03-25 15:27:38,072][__main__][INFO] - Starting iteration 72. [2026-03-25 15:27:38,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:27:38,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:27:43,309][__main__][INFO] - Number of regex retries in iteration 72: 0 [2026-03-25 15:27:43,311][__main__][INFO] - agents played in iteration 72 are Alice, Bob [2026-03-25 15:27:43,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:27:44,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:27:44,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:27:44,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:27:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:27:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:27:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:27:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:27:47,493][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:27:48,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:27:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:27:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:27:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:27:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:27:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:27:52,123][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:27:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:27:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:27:54,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:27:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:27:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:27:56,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:27:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:27:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:27:58,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:27:58,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:27:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:28:00,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:28:00,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:28:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:28:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:28:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:28:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:28:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:28:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:28:05,323][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:28:05,983][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:28:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:28:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:28:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:28:08,620][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:28:09,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:28:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:28:10,598][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:28:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:28:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:28:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:28:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:28:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:28:14,557][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:28:15,217][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:28:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:28:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:28:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:28:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:28:18,840][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:28:19,499][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:28:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:28:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:28:21,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:28:22,137][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:28:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:28:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:28:24,115][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:28:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:28:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:28:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:28:26,755][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:28:27,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:28:28,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:28:29,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:28:29,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:28:29,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:28:30,716][__main__][INFO] - Iteration 73 took 52s (9.92% Gen, 87.42% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 30m 0s. Estimated total time: 14h 37m 7s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 33s. [2026-03-25 15:28:30,720][__main__][INFO] - Starting iteration 73. [2026-03-25 15:28:30,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:28:30,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:28:35,991][__main__][INFO] - Number of regex retries in iteration 73: 0 [2026-03-25 15:28:35,993][__main__][INFO] - agents played in iteration 73 are Alice, Bob [2026-03-25 15:28:36,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:28:36,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:28:36,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:28:36,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:28:37,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:28:38,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:28:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:28:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:28:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:28:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:28:41,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:28:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:28:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:28:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:28:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:28:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:28:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:28:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:28:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:28:47,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:28:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:28:48,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:28:49,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:28:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:28:50,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:28:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:28:51,940][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:28:52,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:28:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:28:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:28:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:28:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:28:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:28:56,550][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:28:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:28:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:28:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:28:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:28:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:29:00,504][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:29:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:29:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:29:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:29:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:29:03,797][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:29:04,457][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:29:05,116][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:29:05,775][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:29:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:29:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:29:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:29:08,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:29:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:29:10,050][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:29:10,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:29:11,367][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:29:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:29:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:29:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:29:14,003][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:29:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:29:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:29:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:29:16,641][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:29:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:29:17,960][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:29:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:29:19,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:29:19,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:29:20,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:29:21,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:29:21,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:29:21,742][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:29:23,363][__main__][INFO] - Iteration 74 took 52s (9.99% Gen, 86.92% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 29m 13s. Estimated total time: 14h 37m 12s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 36s. [2026-03-25 15:29:23,366][__main__][INFO] - Starting iteration 74. [2026-03-25 15:29:23,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:29:23,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:29:30,922][__main__][INFO] - Number of regex retries in iteration 74: 0 [2026-03-25 15:29:30,924][__main__][INFO] - agents played in iteration 74 are Alice, Bob [2026-03-25 15:29:31,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:29:31,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:29:31,592][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:29:31,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:29:32,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:29:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:29:33,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:29:34,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:29:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:29:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:29:36,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:29:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:29:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:29:38,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:29:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:29:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:29:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:29:40,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:29:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:29:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:29:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:29:43,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:29:44,200][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:29:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:29:45,520][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:29:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:29:46,841][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:29:47,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:29:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:29:48,824][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:29:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:29:50,147][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:29:50,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:29:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:29:52,129][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:29:52,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:29:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:29:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:29:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:29:55,433][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:29:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:29:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:29:57,411][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:29:58,071][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:29:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:29:59,390][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:30:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:30:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:30:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:30:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:30:02,691][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:30:03,350][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:30:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:30:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:30:05,661][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:30:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:30:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:30:07,640][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:30:08,301][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:30:08,961][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:30:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:30:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:30:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:30:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:30:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:30:12,920][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:30:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:30:14,239][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:30:14,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:30:15,749][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:30:16,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:30:16,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:30:16,887][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:30:18,206][__main__][INFO] - Iteration 75 took 54s (13.77% Gen, 83.82% Train). Generation: 7s, Training: 45s. Estimated remaining time: 14h 5m 4s. Estimated total time: 15h 13m 58s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 59s. [2026-03-25 15:30:18,215][__main__][INFO] - Starting iteration 75. [2026-03-25 15:30:18,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:30:18,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:30:23,644][__main__][INFO] - Number of regex retries in iteration 75: 0 [2026-03-25 15:30:23,646][__main__][INFO] - agents played in iteration 75 are Alice, Bob [2026-03-25 15:30:24,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:30:24,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:30:24,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:30:24,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:30:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:30:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:30:26,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:30:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:30:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:30:28,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:30:28,965][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:30:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:30:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:30:30,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:30:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:30:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:30:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:30:33,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:30:34,240][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:30:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:30:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:30:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:30:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:30:37,536][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:30:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:30:38,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:30:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:30:40,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:30:40,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:30:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:30:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:30:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:30:43,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:30:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:30:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:30:45,451][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:30:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:30:46,771][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:30:47,429][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:30:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:30:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:30:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:30:50,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:30:50,726][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:30:51,386][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:30:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:30:52,704][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:30:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:30:54,023][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:30:54,683][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:30:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:30:56,001][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:30:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:30:57,648][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:30:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:30:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:30:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:31:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:31:00,945][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:31:01,604][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:31:02,264][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:31:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:31:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:31:04,242][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:31:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:31:05,561][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:31:06,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:31:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:31:07,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:31:08,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:31:09,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:31:09,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:31:09,407][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:31:10,806][__main__][INFO] - Iteration 76 took 52s (10.27% Gen, 87.06% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 26m 14s. Estimated total time: 14h 36m 1s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 0s. [2026-03-25 15:31:10,809][__main__][INFO] - Starting iteration 76. [2026-03-25 15:31:10,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:31:10,814][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:31:18,113][__main__][INFO] - Number of regex retries in iteration 76: 0 [2026-03-25 15:31:18,114][__main__][INFO] - agents played in iteration 76 are Alice, Bob [2026-03-25 15:31:18,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:31:18,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:31:18,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:31:18,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:31:19,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:31:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:31:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:31:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:31:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:31:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:31:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:31:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:31:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:31:25,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:31:25,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:31:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:31:27,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:31:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:31:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:31:29,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:31:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:31:30,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:31:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:31:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:31:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:31:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:31:33,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:31:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:31:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:31:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:31:36,536][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:31:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:31:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:31:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:31:39,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:31:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:31:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:31:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:31:41,811][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:31:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:31:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:31:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:31:44,448][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:31:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:31:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:31:46,428][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:31:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:31:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:31:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:31:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:31:49,725][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:31:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:31:51,321][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:31:51,979][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:31:52,638][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:31:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:31:53,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:31:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:31:55,275][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:31:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:31:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:31:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:31:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:31:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:31:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:31:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:32:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:32:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:32:01,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:32:02,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:32:03,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:32:03,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:32:03,939][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:32:05,565][__main__][INFO] - Iteration 77 took 54s (13.33% Gen, 83.69% Train). Generation: 7s, Training: 45s. Estimated remaining time: 14h 1m 51s. Estimated total time: 15h 12m 32s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 16s. [2026-03-25 15:32:05,570][__main__][INFO] - Starting iteration 77. [2026-03-25 15:32:05,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:32:05,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:32:10,840][__main__][INFO] - Number of regex retries in iteration 77: 0 [2026-03-25 15:32:10,842][__main__][INFO] - agents played in iteration 77 are Alice, Bob [2026-03-25 15:32:11,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:11,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:11,502][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:32:11,502][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:32:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:32:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:32:13,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:32:14,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:32:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:32:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:32:16,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:32:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:32:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:32:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:32:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:32:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:32:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:32:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:32:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:32:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:32:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:32:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:32:24,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:32:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:32:25,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:32:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:32:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:32:27,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:32:28,104][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:32:28,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:32:29,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:32:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:32:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:32:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:32:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:32:32,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:32:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:32:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:32:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:32:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:32:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:32:36,675][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:32:37,334][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:32:37,994][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:32:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:32:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:32:39,971][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:32:40,631][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:32:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:32:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:32:42,611][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:32:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:32:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:32:44,913][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:32:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:32:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:32:46,890][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:32:47,549][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:32:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:32:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:32:49,537][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:32:50,197][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:32:50,856][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:32:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:32:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:32:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:32:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:32:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:32:54,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:32:55,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:32:56,859][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:32:56,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:32:56,866][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:33:01,617][__main__][INFO] - Iteration 78 took 56s (9.40% Gen, 82.12% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 22m 25s. Estimated total time: 15h 34m 3s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 24s, 500 more iterations: 7h 47m 1s. [2026-03-25 15:33:01,619][__main__][INFO] - Starting iteration 78. [2026-03-25 15:33:01,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:33:01,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:33:06,719][__main__][INFO] - Number of regex retries in iteration 78: 0 [2026-03-25 15:33:06,720][__main__][INFO] - agents played in iteration 78 are Alice, Bob [2026-03-25 15:33:07,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:07,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:07,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:33:07,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:33:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:33:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:33:09,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:33:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:33:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:33:11,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:33:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:33:12,642][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:33:13,301][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:33:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:33:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:33:15,276][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:33:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:33:16,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:33:17,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:33:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:33:18,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:33:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:33:19,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:33:20,548][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:33:21,208][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:33:21,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:33:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:33:23,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:33:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:33:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:33:25,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:33:25,820][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:33:26,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:33:27,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:33:27,797][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:33:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:33:29,115][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:33:29,774][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:33:30,433][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:33:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:33:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:33:32,411][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:33:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:33:33,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:33:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:33:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:33:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:33:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:33:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:33:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:33:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:33:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:33:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:33:40,649][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:33:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:33:41,967][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:33:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:33:43,287][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:33:43,949][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:33:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:33:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:33:45,928][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:33:46,589][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:33:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:33:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:33:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:33:49,228][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:33:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:33:50,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:33:51,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:33:52,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:33:52,443][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:33:52,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:33:53,771][__main__][INFO] - Iteration 79 took 52s (9.77% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 16m 39s. Estimated total time: 14h 29m 9s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 34s. [2026-03-25 15:33:53,774][__main__][INFO] - Starting iteration 79. [2026-03-25 15:33:53,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:33:53,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:33:58,732][__main__][INFO] - Number of regex retries in iteration 79: 0 [2026-03-25 15:33:58,734][__main__][INFO] - agents played in iteration 79 are Alice, Bob [2026-03-25 15:33:59,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:59,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:59,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:33:59,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:34:00,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:34:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:34:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:34:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:34:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:34:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:34:04,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:34:04,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:34:05,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:34:06,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:34:06,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:34:07,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:34:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:34:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:34:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:34:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:34:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:34:11,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:34:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:34:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:34:13,279][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:34:13,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:34:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:34:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:34:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:34:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:34:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:34:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:34:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:34:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:34:19,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:34:20,530][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:34:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:34:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:34:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:34:23,166][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:34:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:34:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:34:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:34:25,801][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:34:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:34:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:34:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:34:28,440][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:34:29,099][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:34:29,758][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:34:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:34:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:34:32,064][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:34:32,723][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:34:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:34:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:34:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:34:35,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:34:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:34:36,678][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:34:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:34:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:34:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:34:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:34:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:34:40,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:34:41,295][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:34:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:34:42,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:34:43,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:34:44,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:34:44,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:34:44,484][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:34:45,903][__main__][INFO] - Iteration 80 took 52s (9.51% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 15m 25s. Estimated total time: 14h 28m 46s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 23s. [2026-03-25 15:34:45,906][__main__][INFO] - Starting iteration 80. [2026-03-25 15:34:45,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:34:45,910][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:34:51,027][__main__][INFO] - Number of regex retries in iteration 80: 0 [2026-03-25 15:34:51,028][__main__][INFO] - agents played in iteration 80 are Alice, Bob [2026-03-25 15:34:51,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:34:51,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:34:51,696][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:34:51,696][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:34:52,361][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:34:52,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:34:53,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:34:54,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:34:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:34:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:34:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:34:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:34:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:34:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:34:58,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:34:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:35:00,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:35:00,887][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:35:01,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:35:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:35:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:35:03,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:35:04,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:35:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:35:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:35:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:35:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:35:07,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:35:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:35:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:35:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:35:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:35:10,771][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:35:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:35:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:35:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:35:13,407][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:35:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:35:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:35:15,384][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:35:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:35:16,701][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:35:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:35:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:35:18,677][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:35:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:35:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:35:20,654][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:35:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:35:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:35:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:35:23,290][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:35:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:35:24,933][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:35:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:35:26,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:35:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:35:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:35:28,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:35:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:35:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:35:30,205][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:35:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:35:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:35:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:35:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:35:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:35:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:35:34,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:35:35,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:35:36,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:35:36,845][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:35:36,847][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:35:38,180][__main__][INFO] - Iteration 81 took 52s (9.79% Gen, 87.65% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 16m 58s. Estimated total time: 14h 31m 12s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 36s. [2026-03-25 15:35:38,185][__main__][INFO] - Starting iteration 81. [2026-03-25 15:35:38,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:35:38,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:35:43,476][__main__][INFO] - Number of regex retries in iteration 81: 0 [2026-03-25 15:35:43,478][__main__][INFO] - agents played in iteration 81 are Alice, Bob [2026-03-25 15:35:44,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:35:44,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:35:44,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:35:44,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:35:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:35:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:35:46,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:35:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:35:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:35:48,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:35:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:35:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:35:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:35:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:35:51,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:35:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:35:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:35:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:35:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:35:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:35:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:35:56,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:35:56,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:35:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:35:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:35:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:35:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:36:00,166][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:36:00,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:36:01,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:36:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:36:02,811][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:36:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:36:04,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:36:04,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:36:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:36:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:36:06,777][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:36:07,440][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:36:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:36:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:36:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:36:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:36:10,748][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:36:11,408][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:36:12,069][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:36:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:36:13,392][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:36:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:36:14,715][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:36:15,377][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:36:16,038][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:36:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:36:17,685][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:36:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:36:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:36:19,662][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:36:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:36:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:36:21,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:36:22,297][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:36:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:36:23,615][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:36:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:36:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:36:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:36:26,249][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:36:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:36:27,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:36:28,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:36:29,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:36:29,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:36:29,456][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:36:30,859][__main__][INFO] - Iteration 82 took 52s (10.04% Gen, 87.29% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 22m 44s. Estimated total time: 14h 37m 51s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 47s, 500 more iterations: 7h 18m 55s. [2026-03-25 15:36:30,862][__main__][INFO] - Starting iteration 82. [2026-03-25 15:36:30,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:36:30,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:36:38,330][__main__][INFO] - Number of regex retries in iteration 82: 0 [2026-03-25 15:36:38,332][__main__][INFO] - agents played in iteration 82 are Alice, Bob [2026-03-25 15:36:38,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:36:39,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:36:39,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:36:39,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:36:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:36:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:36:40,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:36:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:36:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:36:42,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:36:43,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:36:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:36:44,938][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:36:45,599][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:36:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:36:46,916][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:36:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:36:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:36:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:36:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:36:50,207][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:36:50,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:36:51,524][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:36:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:36:52,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:36:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:36:54,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:36:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:36:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:36:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:36:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:36:57,454][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:36:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:36:58,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:36:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:37:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:37:00,754][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:37:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:37:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:37:02,732][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:37:03,393][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:37:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:37:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:37:05,374][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:37:06,033][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:37:06,693][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:37:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:37:08,011][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:37:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:37:09,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:37:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:37:10,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:37:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:37:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:37:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:37:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:37:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:37:15,236][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:37:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:37:16,559][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:37:17,220][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:37:17,882][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:37:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:37:19,203][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:37:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:37:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:37:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:37:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:37:22,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:37:23,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:37:24,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:37:24,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:37:24,486][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:37:25,762][__main__][INFO] - Iteration 83 took 54s (13.60% Gen, 84.07% Train). Generation: 7s, Training: 46s. Estimated remaining time: 13h 58m 55s. Estimated total time: 15h 14m 57s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 28s. [2026-03-25 15:37:25,766][__main__][INFO] - Starting iteration 83. [2026-03-25 15:37:25,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:37:25,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:37:31,101][__main__][INFO] - Number of regex retries in iteration 83: 0 [2026-03-25 15:37:31,102][__main__][INFO] - agents played in iteration 83 are Alice, Bob [2026-03-25 15:37:31,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:37:31,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:37:31,812][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:37:31,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:37:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:37:33,206][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:37:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:37:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:37:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:37:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:37:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:37:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:37:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:37:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:37:39,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:37:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:37:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:37:41,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:37:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:37:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:37:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:37:43,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:37:44,426][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:37:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:37:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:37:46,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:37:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:37:47,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:37:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:37:49,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:37:49,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:37:50,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:37:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:37:51,686][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:37:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:37:53,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:37:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:37:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:37:54,986][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:37:55,646][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:37:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:37:56,967][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:37:57,627][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:37:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:37:58,947][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:37:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:02,906][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:04,578][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:05,897][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:38:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:38:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:38:09,195][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:38:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:12,492][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:13,812][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:14,471][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:15,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:15,925][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:38:16,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:16,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:16,988][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:18,459][__main__][INFO] - Iteration 84 took 52s (10.12% Gen, 87.08% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 21m 16s. Estimated total time: 14h 38m 11s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 49s, 500 more iterations: 7h 19m 5s. [2026-03-25 15:38:18,462][__main__][INFO] - Starting iteration 84. [2026-03-25 15:38:18,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:38:18,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:38:23,436][__main__][INFO] - Number of regex retries in iteration 84: 0 [2026-03-25 15:38:23,437][__main__][INFO] - agents played in iteration 84 are Alice, Bob [2026-03-25 15:38:24,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:38:24,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:38:24,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:38:24,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:38:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:38:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:38:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:38:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:38:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:38:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:38:28,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:38:29,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:38:30,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:38:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:38:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:38:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:38:32,663][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:38:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:38:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:38:34,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:38:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:38:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:38:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:38:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:38:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:38:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:38:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:38:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:38:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:38:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:38:41,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:38:42,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:38:43,202][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:38:43,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:38:44,520][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:38:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:38:45,839][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:38:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:38:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:38:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:38:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:38:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:38:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:38:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:38:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:38:51,768][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:56,711][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:57,371][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:58,029][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:58,688][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:00,006][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:02,642][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:03,300][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:03,959][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:04,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:05,934][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:07,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:39:07,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:39:09,095][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:09,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:09,099][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:10,514][__main__][INFO] - Iteration 85 took 52s (9.55% Gen, 87.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 9m 42s. Estimated total time: 14h 27m 29s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 44s. [2026-03-25 15:39:10,517][__main__][INFO] - Starting iteration 85. [2026-03-25 15:39:10,520][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:39:10,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:15,524][__main__][INFO] - Number of regex retries in iteration 85: 0 [2026-03-25 15:39:15,526][__main__][INFO] - agents played in iteration 85 are Alice, Bob [2026-03-25 15:39:16,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:39:16,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:39:16,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:39:16,204][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:39:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:19,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:20,783][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:21,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:22,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:22,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:39:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:39:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:39:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:39:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:39:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:39:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:39:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:39:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:39:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:39:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:39:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:39:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:39:31,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:39:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:39:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:39:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:39:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:39:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:39:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:39:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:39:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:39:37,904][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:39:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:39:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:39:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:39:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:39:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:39:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:39:42,514][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:39:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:39:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:39:44,492][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:39:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:39:45,811][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:39:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:39:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:39:47,786][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:39:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:39:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:39:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:39:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:39:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:54,036][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:54,695][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:55,354][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:56,012][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:56,670][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:57,988][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:59,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:00,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:40:01,062][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:01,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:01,068][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:02,535][__main__][INFO] - Iteration 86 took 52s (9.58% Gen, 87.55% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 8m 17s. Estimated total time: 14h 26m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 28s. [2026-03-25 15:40:02,538][__main__][INFO] - Starting iteration 86. [2026-03-25 15:40:02,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:40:02,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:07,484][__main__][INFO] - Number of regex retries in iteration 86: 0 [2026-03-25 15:40:07,486][__main__][INFO] - agents played in iteration 86 are Alice, Bob [2026-03-25 15:40:08,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:08,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:08,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:40:08,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:40:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:10,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:11,361][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:40:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:40:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:40:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:40:13,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:40:14,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:40:15,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:15,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:17,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:18,606][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:19,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:22,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:40:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:40:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:40:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:40:27,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:40:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:40:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:40:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:40:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:40:30,460][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:40:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:40:31,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:40:32,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:40:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:40:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:40:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:40:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:40:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:40:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:40:37,045][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:40:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:40:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:40:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:40:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:40:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:40:41,319][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:40:41,977][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:40:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:40:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:40:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:40:44,613][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:40:45,272][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:40:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:40:46,589][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:40:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:40:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:40:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:40:49,227][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:40:49,885][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:40:50,543][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:40:51,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:52,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:40:53,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:53,059][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:53,060][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:54,428][__main__][INFO] - Iteration 87 took 51s (9.53% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 5m 17s. Estimated total time: 14h 24m 47s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 23s. [2026-03-25 15:40:54,431][__main__][INFO] - Starting iteration 87. [2026-03-25 15:40:54,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:40:54,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:59,485][__main__][INFO] - Number of regex retries in iteration 87: 0 [2026-03-25 15:40:59,487][__main__][INFO] - agents played in iteration 87 are Alice, Bob [2026-03-25 15:41:00,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:00,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:00,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:41:00,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:41:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:01,548][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:02,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:03,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:04,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:04,837][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:41:08,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:41:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:41:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:41:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:41:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:41:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:41:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:41:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:41:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:41:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:41:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:41:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:41:16,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:19,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:20,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:23,920][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:25,893][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:41:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:41:28,525][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:41:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:41:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:41:30,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:41:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:41:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:41:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:41:33,398][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:41:34,055][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:41:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:41:35,372][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:41:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:41:36,689][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:41:37,347][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:41:38,006][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:41:38,664][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:41:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:41:39,982][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:41:40,641][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:41:41,299][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:41:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:41:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:41:43,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:41:43,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:41:45,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:41:45,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:41:45,082][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:41:46,379][__main__][INFO] - Iteration 88 took 51s (9.72% Gen, 87.78% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 5m 23s. Estimated total time: 14h 25m 45s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 52s. [2026-03-25 15:41:46,382][__main__][INFO] - Starting iteration 88. [2026-03-25 15:41:46,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:41:46,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:41:48,226][mllm.models.large_language_model_local][WARNING] - Response %A did not match regex: (|), retry 1/1 [2026-03-25 15:41:52,369][__main__][INFO] - Number of regex retries in iteration 88: 1 [2026-03-25 15:41:52,370][__main__][INFO] - agents played in iteration 88 are Alice, Bob [2026-03-25 15:41:52,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:53,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:53,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:41:53,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:41:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:54,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:55,618][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:58,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:59,565][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:05,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:07,459][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:42:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:42:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:42:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:42:10,750][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:42:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:42:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:42:12,725][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:42:13,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:42:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:42:14,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:42:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:42:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:42:16,675][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:42:17,334][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:42:17,993][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:42:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:42:19,309][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:42:19,967][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:21,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:23,257][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:25,486][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:26,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:27,459][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:28,118][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:42:30,750][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:42:31,409][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:42:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:42:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:42:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:42:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:42:34,702][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:42:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:42:36,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:42:36,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:42:37,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:42:37,816][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:42:37,818][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:42:39,082][__main__][INFO] - Iteration 89 took 52s (11.36% Gen, 86.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 17m 2s. Estimated total time: 14h 38m 17s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 49s, 500 more iterations: 7h 19m 8s. [2026-03-25 15:42:39,084][__main__][INFO] - Starting iteration 89. [2026-03-25 15:42:39,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:42:39,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:42:43,985][__main__][INFO] - Number of regex retries in iteration 89: 0 [2026-03-25 15:42:43,987][__main__][INFO] - agents played in iteration 89 are Alice, Bob [2026-03-25 15:42:44,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:42:44,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:42:44,636][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:42:44,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:42:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:42:45,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:42:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:42:47,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:42:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:42:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:42:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:49,881][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:51,856][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:53,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:55,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:57,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:59,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:03,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:43:07,652][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:43:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:43:08,968][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:43:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:43:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:43:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:43:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:43:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:43:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:43:13,575][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:43:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:43:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:43:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:43:16,210][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:43:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:43:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:43:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:43:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:43:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:43:20,390][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:43:21,048][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:43:21,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:24,998][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:25,657][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:27,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:29,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:43:30,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:30,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:30,177][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:31,574][__main__][INFO] - Iteration 90 took 52s (9.33% Gen, 88.00% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 12m 40s. Estimated total time: 14h 34m 47s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 23s. [2026-03-25 15:43:31,578][__main__][INFO] - Starting iteration 90. [2026-03-25 15:43:31,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:43:31,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:43:36,476][__main__][INFO] - Number of regex retries in iteration 90: 0 [2026-03-25 15:43:36,478][__main__][INFO] - agents played in iteration 90 are Alice, Bob [2026-03-25 15:43:37,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:43:37,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:43:37,148][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:43:37,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:43:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:43:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:43:39,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:43:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:43:40,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:43:41,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:43:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:43:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:43:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:43:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:43:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:43:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:43:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:43:46,373][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:43:47,033][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:43:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:43:48,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:43:49,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:50,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:52,970][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:54,289][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:55,610][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:56,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:59,566][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:02,866][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:03,525][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:04,185][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:07,484][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:44:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:44:08,804][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:44:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:44:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:44:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:44:11,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:44:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:44:13,019][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:44:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:44:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:44:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:44:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:44:16,317][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:44:16,977][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:44:17,636][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:44:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:44:18,957][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:44:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:44:20,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:44:21,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:44:22,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:44:22,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:44:22,152][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:44:23,537][__main__][INFO] - Iteration 91 took 51s (9.42% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 2m 56s. Estimated total time: 14h 25m 56s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 58s. [2026-03-25 15:44:23,539][__main__][INFO] - Starting iteration 91. [2026-03-25 15:44:23,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:44:23,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:28,701][__main__][INFO] - Number of regex retries in iteration 91: 0 [2026-03-25 15:44:28,703][__main__][INFO] - agents played in iteration 91 are Alice, Bob [2026-03-25 15:44:29,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:44:29,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:44:29,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:44:29,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:44:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:30,643][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:44:31,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:44:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:44:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:44:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:44:34,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:44:35,258][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:44:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:44:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:44:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:44:37,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:44:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:44:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:44:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:44:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:44:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:44:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:44:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:44:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:44:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:44:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:47,124][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:47,783][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:51,080][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:53,058][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:57,014][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:58,992][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:00,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:06,535][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:07,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:45:08,513][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:45:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:45:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:45:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:45:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:45:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:45:12,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:45:13,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:45:14,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:45:14,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:45:14,266][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:45:15,606][__main__][INFO] - Iteration 92 took 52s (9.91% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 3m 53s. Estimated total time: 14h 27m 44s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 52s. [2026-03-25 15:45:15,609][__main__][INFO] - Starting iteration 92. [2026-03-25 15:45:15,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:45:15,613][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:45:20,359][__main__][INFO] - Number of regex retries in iteration 92: 0 [2026-03-25 15:45:20,361][__main__][INFO] - agents played in iteration 92 are Alice, Bob [2026-03-25 15:45:20,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:45:21,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:45:21,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:45:21,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:45:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:45:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:45:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:23,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:24,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:25,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:26,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:26,894][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:29,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:30,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:33,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:45:34,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:45:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:45:36,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:45:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:45:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:45:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:45:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:45:39,408][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:45:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:45:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:45:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:45:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:45:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:45:44,015][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:45:44,675][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:45:45,333][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:45:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:45:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:45:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:48,628][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:49,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:50,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:57,447][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:58,106][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:58,765][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:00,745][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:03,384][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:04,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:04,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:46:06,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:06,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:06,056][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:07,513][__main__][INFO] - Iteration 93 took 51s (9.15% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 0m 18s. Estimated total time: 14h 25m 2s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 31s. [2026-03-25 15:46:07,515][__main__][INFO] - Starting iteration 93. [2026-03-25 15:46:07,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:46:07,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:46:12,403][__main__][INFO] - Number of regex retries in iteration 93: 0 [2026-03-25 15:46:12,404][__main__][INFO] - agents played in iteration 93 are Alice, Bob [2026-03-25 15:46:12,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:46:12,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:46:12,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:46:12,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:46:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:46:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:46:14,958][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:46:15,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:46:16,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:46:16,931][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:46:17,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:46:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:46:18,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:46:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:46:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:46:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:46:21,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:46:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:46:22,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:46:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:46:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:46:24,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:46:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:46:26,148][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:27,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:28,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:30,102][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:30,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:32,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:34,715][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:36,691][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:46:37,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:46:38,023][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:46:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:46:39,350][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:46:40,020][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:46:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:46:41,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:46:42,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:46:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:46:43,343][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:46:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:46:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:46:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:46:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:46:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:49,049][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:49,712][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:50,379][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:53,038][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:55,028][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:56,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:57,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:46:58,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:58,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:58,560][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:59,845][__main__][INFO] - Iteration 94 took 52s (9.33% Gen, 88.20% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 6m 32s. Estimated total time: 14h 32m 8s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 4s. [2026-03-25 15:46:59,848][__main__][INFO] - Starting iteration 94. [2026-03-25 15:46:59,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:46:59,853][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:05,027][__main__][INFO] - Number of regex retries in iteration 94: 0 [2026-03-25 15:47:05,028][__main__][INFO] - agents played in iteration 94 are Alice, Bob [2026-03-25 15:47:05,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:05,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:05,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:47:05,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:47:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:47:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:47:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:47:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:47:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:47:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:47:10,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:47:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:47:11,909][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:47:12,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:47:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:47:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:47:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:47:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:47:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:47:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:47:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:47:17,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:47:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:47:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:47:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:47:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:47:21,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:47:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:47:22,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:47:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:47:23,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:47:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:47:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:47:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:47:26,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:47:27,104][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:47:27,764][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:47:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:47:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:31,066][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:33,046][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:36,349][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:37,009][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:47:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:47:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:47:42,633][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:47:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:47:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:47:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:47:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:49,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:50,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:47:51,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:47:51,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:47:51,092][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:47:52,668][__main__][INFO] - Iteration 95 took 52s (9.80% Gen, 87.21% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 13m 49s. Estimated total time: 14h 40m 17s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 8s. [2026-03-25 15:47:52,670][__main__][INFO] - Starting iteration 95. [2026-03-25 15:47:52,674][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:47:52,675][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:57,962][__main__][INFO] - Number of regex retries in iteration 95: 0 [2026-03-25 15:47:57,963][__main__][INFO] - agents played in iteration 95 are Alice, Bob [2026-03-25 15:47:58,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:58,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:58,681][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:47:58,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:47:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:00,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:00,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:02,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:04,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:06,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:48:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:48:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:48:08,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:48:09,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:48:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:48:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:48:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:48:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:48:12,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:48:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:48:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:48:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:48:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:48:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:48:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:48:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:48:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:48:18,607][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:48:19,267][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:48:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:48:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:48:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:48:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:48:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:48:23,233][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:48:23,894][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:48:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:48:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:48:25,876][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:28,523][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:48:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:48:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:48:31,499][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:48:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:48:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:48:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:34,141][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:34,802][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:39,423][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:41,406][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:42,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:42,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:48:44,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:44,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:44,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:45,303][__main__][INFO] - Iteration 96 took 52s (10.05% Gen, 87.50% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 9m 49s. Estimated total time: 14h 37m 10s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 35s. [2026-03-25 15:48:45,305][__main__][INFO] - Starting iteration 96. [2026-03-25 15:48:45,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:48:45,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:48:50,517][__main__][INFO] - Number of regex retries in iteration 96: 0 [2026-03-25 15:48:50,518][__main__][INFO] - agents played in iteration 96 are Alice, Bob [2026-03-25 15:48:51,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:48:51,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:48:51,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:48:51,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:48:52,224][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:54,162][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:54,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:56,806][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:58,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:59,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:03,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:06,044][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:49:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:49:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:49:08,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:49:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:49:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:49:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:49:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:49:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:49:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:49:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:49:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:49:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:49:15,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:49:15,937][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:49:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:49:17,255][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:49:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:49:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:49:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:49:19,893][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:49:20,552][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:49:21,212][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:49:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:24,254][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:49:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:49:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:49:28,212][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:49:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:49:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:49:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:49:30,852][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:49:31,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:49:32,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:49:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:49:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:49:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:49:34,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:49:35,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:49:36,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:36,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:36,850][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:38,213][__main__][INFO] - Iteration 97 took 52s (9.84% Gen, 87.58% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 13m 31s. Estimated total time: 14h 41m 45s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 52s. [2026-03-25 15:49:38,216][__main__][INFO] - Starting iteration 97. [2026-03-25 15:49:38,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:49:38,221][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:43,519][__main__][INFO] - Number of regex retries in iteration 97: 0 [2026-03-25 15:49:43,520][__main__][INFO] - agents played in iteration 97 are Alice, Bob [2026-03-25 15:49:44,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:44,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:44,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:49:44,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:49:45,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:45,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:49:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:49:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:49:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:49:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:49:49,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:49:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:49:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:49:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:49:52,502][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:57,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:58,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:59,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:02,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:03,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:05,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:06,359][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:07,018][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:50:07,677][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:50:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:50:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:50:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:50:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:50:10,974][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:50:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:50:12,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:50:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:50:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:50:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:50:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:50:15,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:50:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:50:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:50:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:50:18,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:50:19,315][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:21,294][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:21,954][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:22,613][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:23,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:50:23,935][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:50:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:50:25,255][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:50:25,914][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:50:26,574][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:50:27,233][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:50:27,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:50:28,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:50:29,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:50:29,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:50:29,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:50:31,364][__main__][INFO] - Iteration 98 took 53s (9.97% Gen, 87.18% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 16m 39s. Estimated total time: 14h 45m 46s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 53s. [2026-03-25 15:50:31,367][__main__][INFO] - Starting iteration 98. [2026-03-25 15:50:31,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:50:31,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:50:36,590][__main__][INFO] - Number of regex retries in iteration 98: 0 [2026-03-25 15:50:36,591][__main__][INFO] - agents played in iteration 98 are Alice, Bob [2026-03-25 15:50:37,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:50:37,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:50:37,448][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:50:37,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:50:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:45,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:46,747][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:48,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:50:48,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:50:49,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:50:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:50:50,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:50:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:50:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:50:52,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:53,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:53,997][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:54,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:55,975][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:59,272][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:01,248][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:01,907][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:03,225][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:03,886][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:04,544][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:51:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:51:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:51:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:51:08,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:51:09,160][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:51:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:51:10,820][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:51:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:51:12,138][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:51:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:51:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:51:14,115][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:51:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:51:15,435][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:51:16,095][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:18,073][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:19,392][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:51:20,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:51:21,555][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:51:22,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:51:22,668][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:51:22,669][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:51:24,003][__main__][INFO] - Iteration 99 took 52s (9.92% Gen, 87.54% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 7m 14s. Estimated total time: 14h 37m 14s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 37s. [2026-03-25 15:51:24,005][__main__][INFO] - Starting iteration 99. [2026-03-25 15:51:24,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:51:24,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:51:30,100][__main__][INFO] - Number of regex retries in iteration 99: 0 [2026-03-25 15:51:30,101][__main__][INFO] - agents played in iteration 99 are Alice, Bob [2026-03-25 15:51:30,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:51:30,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:51:30,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:51:30,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:51:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:51:32,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:51:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:51:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:51:34,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:51:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:51:35,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:51:36,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:51:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:51:37,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:51:38,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:51:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:51:39,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:51:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:40,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:44,862][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:45,523][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:48,165][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:48,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:51:51,465][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:51:52,125][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:51:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:51:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:54,765][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:58,064][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:01,364][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:03,684][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:05,661][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:06,320][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:52:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:52:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:52:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:52:08,956][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:52:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:52:10,275][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:52:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:52:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:52:12,254][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:52:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:52:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:14,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:15,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:52:16,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:16,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:16,212][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:17,536][__main__][INFO] - Iteration 100 took 53s (11.38% Gen, 86.14% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 21m 15s. Estimated total time: 14h 52m 8s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 12s, 500 more iterations: 7h 26m 4s. [2026-03-25 15:52:17,538][__main__][INFO] - Starting iteration 100. [2026-03-25 15:52:17,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:52:17,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:52:23,849][__main__][INFO] - Number of regex retries in iteration 100: 0 [2026-03-25 15:52:23,851][__main__][INFO] - agents played in iteration 100 are Alice, Bob [2026-03-25 15:52:24,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:52:24,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:52:24,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:52:24,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:52:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:52:25,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:52:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:52:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:52:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:52:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:52:29,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:52:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:52:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:52:31,091][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:52:31,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:52:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:52:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:52:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:52:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:52:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:52:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:52:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:52:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:52:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:52:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:52:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:52:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:52:40,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:52:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:52:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:44,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:45,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:52,209][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:53,527][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:54,849][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:57,200][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:57,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:58,518][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:59,177][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:59,837][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:00,498][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:01,158][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:03,136][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:03,795][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:04,458][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:05,117][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:05,777][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:53:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:53:07,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:53:08,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:53:09,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:53:09,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:53:09,775][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:53:12,345][__main__][INFO] - Iteration 101 took 54s (11.51% Gen, 83.79% Train). Generation: 6s, Training: 45s. Estimated remaining time: 13h 41m 36s. Estimated total time: 15h 13m 24s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 42s. [2026-03-25 15:53:12,347][__main__][INFO] - Starting iteration 101. [2026-03-25 15:53:12,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:53:12,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:53:17,246][__main__][INFO] - Number of regex retries in iteration 101: 0 [2026-03-25 15:53:17,247][__main__][INFO] - agents played in iteration 101 are Alice, Bob [2026-03-25 15:53:17,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:53:18,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:53:18,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:53:18,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:53:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:53:19,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:53:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:53:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:53:21,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:53:22,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:53:22,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:53:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:53:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:53:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:53:25,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:53:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:53:26,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:53:27,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:53:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:53:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:53:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:53:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:53:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:53:31,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:53:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:53:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:53:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:53:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:53:34,626][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:53:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:53:35,949][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:53:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:53:37,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:53:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:53:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:53:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:53:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:53:40,574][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:53:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:41,898][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:43,224][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:53:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:53:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:53:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:53:45,869][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:53:46,529][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:53:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:53:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:53:48,511][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:53:49,171][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:53,468][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:54,128][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:58,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:00,732][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:01,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:02,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:54:03,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:03,307][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:03,309][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:04,764][__main__][INFO] - Iteration 102 took 52s (9.34% Gen, 87.88% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 0m 53s. Estimated total time: 14h 33m 34s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 47s. [2026-03-25 15:54:04,767][__main__][INFO] - Starting iteration 102. [2026-03-25 15:54:04,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:54:04,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:54:09,684][__main__][INFO] - Number of regex retries in iteration 102: 0 [2026-03-25 15:54:09,684][__main__][INFO] - agents played in iteration 102 are Alice, Bob [2026-03-25 15:54:10,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:54:10,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:54:10,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:54:10,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:54:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:54:11,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:54:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:54:13,059][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:54:13,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:54:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:54:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:54:15,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:54:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:54:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:54:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:54:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:54:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:54:19,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:54:20,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:54:20,988][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:54:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:54:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:54:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:54:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:54:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:54:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:54:25,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:54:26,288][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:54:26,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:54:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:54:28,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:54:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:54:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:54:30,252][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:54:30,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:54:31,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:54:32,234][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:54:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:54:33,555][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:54:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:54:34,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:54:35,536][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:36,196][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:54:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:54:40,153][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:54:40,813][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:54:41,472][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:54:42,132][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:54:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:54:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:54:44,448][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:54:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:54:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:54:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:54:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:54:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:54:48,405][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:54:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:54:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:54:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:54:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:54:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:53,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:54,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:54:55,607][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:55,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:55,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:56,909][__main__][INFO] - Iteration 103 took 52s (9.42% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 55m 27s. Estimated total time: 14h 29m 0s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 30s. [2026-03-25 15:54:56,912][__main__][INFO] - Starting iteration 103. [2026-03-25 15:54:56,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:54:56,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:01,717][__main__][INFO] - Number of regex retries in iteration 103: 0 [2026-03-25 15:55:01,718][__main__][INFO] - agents played in iteration 103 are Alice, Bob [2026-03-25 15:55:02,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:02,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:02,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:55:02,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:55:03,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:04,365][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:55:10,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:55:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:55:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:55:12,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:55:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:55:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:55:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:55:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:55:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:55:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:55:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:55:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:55:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:55:19,545][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:55:20,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:55:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:55:21,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:55:22,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:55:22,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:55:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:55:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:55:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:55:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:55:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:55:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:55:27,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:55:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:55:28,780][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:55:29,440][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:55:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:55:30,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:32,742][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:35,716][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:55:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:55:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:55:37,696][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:55:38,357][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:55:39,016][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:55:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:55:40,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:55:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:55:41,659][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:55:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:55:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:55:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:55:44,302][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:55:44,963][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:55:45,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:55:46,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:55:47,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:55:47,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:55:47,579][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:55:48,826][__main__][INFO] - Iteration 104 took 51s (9.25% Gen, 88.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 50m 47s. Estimated total time: 14h 25m 12s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 36s. [2026-03-25 15:55:48,828][__main__][INFO] - Starting iteration 104. [2026-03-25 15:55:48,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:55:48,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:53,855][__main__][INFO] - Number of regex retries in iteration 104: 0 [2026-03-25 15:55:53,856][__main__][INFO] - agents played in iteration 104 are Alice, Bob [2026-03-25 15:55:54,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:54,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:54,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:55:54,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:55:55,316][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:57,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:00,570][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:04,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:05,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:06,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:07,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:09,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:56:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:56:14,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:56:15,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:56:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:56:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:56:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:56:17,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:56:18,395][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:56:19,055][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:56:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:56:20,374][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:56:21,034][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:56:21,695][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:56:22,354][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:56:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:56:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:56:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:56:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:56:25,651][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:56:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:56:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:56:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:32,580][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:33,899][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:56:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:56:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:56:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:56:37,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:56:38,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:56:39,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:56:39,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:56:39,667][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:56:41,088][__main__][INFO] - Iteration 105 took 52s (9.61% Gen, 87.66% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 55m 40s. Estimated total time: 14h 30m 57s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 28s. [2026-03-25 15:56:41,091][__main__][INFO] - Starting iteration 105. [2026-03-25 15:56:41,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:56:41,094][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:56:46,098][__main__][INFO] - Number of regex retries in iteration 105: 0 [2026-03-25 15:56:46,100][__main__][INFO] - agents played in iteration 105 are Alice, Bob [2026-03-25 15:56:46,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:56:46,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:56:46,748][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:56:46,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:56:47,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:56:48,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:56:48,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:56:49,475][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:56:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:50,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:58,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:59,364][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:00,023][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:00,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:01,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:04,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:05,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:11,885][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:13,203][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:13,862][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:16,498][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:57:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:57:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:57:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:57:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:57:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:57:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:57:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:57:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:57:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:57:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:57:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:57:24,746][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:57:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:57:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:57:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:57:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:30,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:30,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:57:31,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:57:31,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:57:31,869][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:57:33,141][__main__][INFO] - Iteration 106 took 52s (9.62% Gen, 87.94% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 51m 19s. Estimated total time: 14h 27m 28s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 44s. [2026-03-25 15:57:33,143][__main__][INFO] - Starting iteration 106. [2026-03-25 15:57:33,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:57:33,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:57:38,073][__main__][INFO] - Number of regex retries in iteration 106: 0 [2026-03-25 15:57:38,074][__main__][INFO] - agents played in iteration 106 are Alice, Bob [2026-03-25 15:57:38,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:57:38,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:57:38,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:57:38,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:57:39,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:57:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:57:40,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:57:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:57:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:57:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:57:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:57:44,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:57:44,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:57:45,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:46,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:48,082][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:48,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:49,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:57:50,058][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:57:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:57:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:52,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:54,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:55,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:56,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:02,580][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:03,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:05,216][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:07,195][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:07,855][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:58:08,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:12,825][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:13,485][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:14,807][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:58:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:58:18,766][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:58:19,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:58:20,085][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:58:20,745][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:58:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:58:22,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:58:22,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:58:23,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:23,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:23,983][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:25,190][__main__][INFO] - Iteration 107 took 52s (9.47% Gen, 88.21% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 50m 23s. Estimated total time: 14h 27m 25s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 42s. [2026-03-25 15:58:25,192][__main__][INFO] - Starting iteration 107. [2026-03-25 15:58:25,196][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:58:25,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:58:30,034][__main__][INFO] - Number of regex retries in iteration 107: 0 [2026-03-25 15:58:30,035][__main__][INFO] - agents played in iteration 107 are Alice, Bob [2026-03-25 15:58:30,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:58:30,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:58:30,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:58:30,645][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:58:31,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:58:32,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:58:32,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:58:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:58:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:58:34,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:58:35,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:58:36,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:58:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:58:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:58:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:58:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:58:39,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:58:40,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:58:40,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:58:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:41,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:58:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:58:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:58:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:58:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:58:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:58:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:58:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:58:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:58:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:58:51,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:58:51,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:58,498][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:59,158][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:03,458][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:04,778][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:06,098][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:07,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:59:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:59:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:59:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:10,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:13,368][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:14,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:14,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:59:15,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:15,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:15,959][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:17,228][__main__][INFO] - Iteration 108 took 52s (9.30% Gen, 88.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 49m 20s. Estimated total time: 14h 27m 13s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 36s. [2026-03-25 15:59:17,231][__main__][INFO] - Starting iteration 108. [2026-03-25 15:59:17,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:59:17,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:22,283][__main__][INFO] - Number of regex retries in iteration 108: 0 [2026-03-25 15:59:22,284][__main__][INFO] - agents played in iteration 108 are Alice, Bob [2026-03-25 15:59:22,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:22,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:22,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:59:22,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:59:23,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:25,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:26,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:59:26,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:59:27,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:59:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:59:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:59:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:59:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:59:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:59:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:59:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:59:32,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:59:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:59:34,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:59:34,925][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:59:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:59:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:59:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:39,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:40,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:59:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:59:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:59:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:59:43,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:59:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:59:44,830][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:59:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:59:46,149][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:59:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:59:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:59:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:59:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:59:49,452][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:59:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:59:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:59:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:59:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:52,755][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:54,735][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:56,403][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:57,724][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:03,009][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:03,669][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:05,651][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:06,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:07,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:00:08,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:00:08,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:00:08,377][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:00:09,580][__main__][INFO] - Iteration 109 took 52s (9.65% Gen, 88.05% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 53m 41s. Estimated total time: 14h 32m 27s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 13s. [2026-03-25 16:00:09,582][__main__][INFO] - Starting iteration 109. [2026-03-25 16:00:09,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:00:09,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:14,382][__main__][INFO] - Number of regex retries in iteration 109: 0 [2026-03-25 16:00:14,383][__main__][INFO] - agents played in iteration 109 are Alice, Bob [2026-03-25 16:00:14,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:00:14,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:00:14,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:00:14,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:00:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:16,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:17,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:18,328][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:21,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:00:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:00:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:00:26,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:00:27,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:00:28,243][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:00:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:00:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:00:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:00:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:00:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:00:32,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:00:32,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:33,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:36,190][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:00:37,511][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:00:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:00:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:00:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:00:40,155][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:00:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:00:41,476][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:00:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:00:42,796][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:00:43,457][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:00:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:00:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:00:45,438][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:00:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:00:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:00:47,790][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:00:48,452][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:00:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:00:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:00:50,433][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:00:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:00:51,752][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:52,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:53,073][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:54,396][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:56,382][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:58,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:59,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:01:00,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:00,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:00,198][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:01,666][__main__][INFO] - Iteration 110 took 52s (9.21% Gen, 87.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 48m 24s. Estimated total time: 14h 28m 2s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 1s. [2026-03-25 16:01:01,669][__main__][INFO] - Starting iteration 110. [2026-03-25 16:01:01,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:01:01,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:01:06,557][__main__][INFO] - Number of regex retries in iteration 110: 0 [2026-03-25 16:01:06,559][__main__][INFO] - agents played in iteration 110 are Alice, Bob [2026-03-25 16:01:07,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:01:07,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:01:07,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:01:07,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:01:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:01:08,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:01:09,188][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:01:09,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:01:10,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:01:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:13,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:14,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:19,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:21,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:23,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:23,716][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:01:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:01:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:01:25,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:01:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:01:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:01:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:01:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:01:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:29,659][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:30,323][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:32,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:01:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:01:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:01:34,943][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:01:35,603][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:01:36,263][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:01:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:01:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:01:38,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:01:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:01:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:01:40,559][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:01:41,219][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:01:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:01:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:01:43,201][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:01:43,860][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:01:44,520][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:01:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:01:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:01:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:01:47,159][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:01:47,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:01:48,480][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:01:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:01:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:01:50,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:01:51,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:01:52,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:52,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:52,378][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:53,748][__main__][INFO] - Iteration 111 took 52s (9.38% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 47m 27s. Estimated total time: 14h 27m 57s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 58s. [2026-03-25 16:01:53,750][__main__][INFO] - Starting iteration 111. [2026-03-25 16:01:53,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:01:53,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:01,121][__main__][INFO] - Number of regex retries in iteration 111: 0 [2026-03-25 16:02:01,122][__main__][INFO] - agents played in iteration 111 are Alice, Bob [2026-03-25 16:02:01,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:01,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:01,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:02:01,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:02:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:05,019][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:05,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:06,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:06,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:07,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:02:08,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:02:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:02:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:02:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:02:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:02:11,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:02:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:02:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:02:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:14,909][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:16,886][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:19,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:21,499][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:22,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:24,134][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:02:24,793][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:02:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:02:26,109][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:02:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:02:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:29,404][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:02:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:02:30,723][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:02:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:02:32,041][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:02:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:02:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:02:34,353][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:02:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:02:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:02:36,329][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:02:36,989][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:02:37,648][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:02:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:02:38,967][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:02:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:02:40,285][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:02:40,946][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:02:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:02:42,263][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:02:42,922][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:02:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:02:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:02:44,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:02:45,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:02:46,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:02:46,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:02:46,895][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:02:48,098][__main__][INFO] - Iteration 112 took 54s (13.56% Gen, 84.22% Train). Generation: 7s, Training: 45s. Estimated remaining time: 13h 24m 22s. Estimated total time: 15h 5m 46s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 34s, 500 more iterations: 7h 32m 53s. [2026-03-25 16:02:48,100][__main__][INFO] - Starting iteration 112. [2026-03-25 16:02:48,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:02:48,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:52,902][__main__][INFO] - Number of regex retries in iteration 112: 0 [2026-03-25 16:02:52,903][__main__][INFO] - agents played in iteration 112 are Alice, Bob [2026-03-25 16:02:53,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:53,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:53,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:02:53,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:02:54,253][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:56,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:57,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:58,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:58,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:00,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:02,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:03,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:04,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:04,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:06,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:03:08,717][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:03:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:03:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:03:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:03:11,353][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:03:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:03:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:03:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:03:13,988][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:03:14,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:03:15,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:03:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:16,624][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:18,601][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:19,260][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:20,576][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:21,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:23,213][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:24,531][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:25,190][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:03:27,504][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:03:28,166][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:03:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:03:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:03:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:03:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:03:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:03:32,135][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:03:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:03:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:03:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:03:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:03:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:03:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:03:36,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:03:37,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:03:38,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:03:38,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:03:38,628][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:03:39,859][__main__][INFO] - Iteration 113 took 51s (9.27% Gen, 88.35% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 40m 21s. Estimated total time: 14h 22m 37s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 15s, 500 more iterations: 7h 11m 18s. [2026-03-25 16:03:39,861][__main__][INFO] - Starting iteration 113. [2026-03-25 16:03:39,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:03:39,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:03:44,592][__main__][INFO] - Number of regex retries in iteration 113: 0 [2026-03-25 16:03:44,593][__main__][INFO] - agents played in iteration 113 are Alice, Bob [2026-03-25 16:03:45,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:03:45,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:03:45,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:03:45,252][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:03:46,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:03:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:03:47,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:03:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:03:48,600][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:03:49,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:03:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:03:50,580][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:03:51,242][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:55,205][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:57,851][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:00,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:01,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:03,127][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:03,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:05,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:06,424][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:07,084][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:07,744][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:04:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:04:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:04:09,723][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:04:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:13,681][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:14,341][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:15,001][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:15,661][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:16,321][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:18,634][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:19,294][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:19,954][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:23,251][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:23,911][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:25,892][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:27,873][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:28,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:04:29,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:04:30,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:04:30,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:04:30,332][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:04:31,770][__main__][INFO] - Iteration 114 took 51s (9.11% Gen, 88.12% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 41m 59s. Estimated total time: 14h 25m 6s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 33s. [2026-03-25 16:04:31,772][__main__][INFO] - Starting iteration 114. [2026-03-25 16:04:31,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:04:31,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:04:36,630][__main__][INFO] - Number of regex retries in iteration 114: 0 [2026-03-25 16:04:36,631][__main__][INFO] - agents played in iteration 114 are Alice, Bob [2026-03-25 16:04:37,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:04:37,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:04:37,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:04:37,280][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:04:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:04:38,659][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:04:39,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:04:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:04:40,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:04:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:04:41,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:04:42,620][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:04:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:04:43,942][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:04:44,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:04:45,265][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:04:45,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:04:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:04:47,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:04:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:04:48,566][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:04:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:04:49,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:04:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:04:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:04:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:53,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:53,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:56,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:59,794][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:04,412][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:07,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:05:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:05:08,370][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:05:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:05:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:05:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:05:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:05:19,263][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:05:19,923][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:05:20,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:21,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:05:22,500][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:22,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:22,505][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:24,264][__main__][INFO] - Iteration 115 took 52s (9.25% Gen, 87.39% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 50m 50s. Estimated total time: 14h 34m 50s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 25s. [2026-03-25 16:05:24,266][__main__][INFO] - Starting iteration 115. [2026-03-25 16:05:24,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:05:24,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:29,325][__main__][INFO] - Number of regex retries in iteration 115: 0 [2026-03-25 16:05:29,327][__main__][INFO] - agents played in iteration 115 are Alice, Bob [2026-03-25 16:05:29,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:05:29,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:05:29,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:05:29,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:05:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:05:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:05:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:05:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:05:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:05:33,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:05:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:05:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:05:35,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:05:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:05:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:05:37,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:05:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:05:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:05:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:05:40,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:05:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:05:41,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:05:42,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:05:43,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:05:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:05:44,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:05:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:05:45,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:05:46,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:05:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:05:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:05:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:05:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:05:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:05:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:05:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:05:51,769][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:05:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:53,089][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:02,826][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:03,487][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:06:07,457][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:06:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:08,779][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:10,100][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:13,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:14,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:06:15,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:15,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:15,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:06:16,851][__main__][INFO] - Iteration 116 took 52s (9.62% Gen, 87.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 51m 30s. Estimated total time: 14h 36m 23s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 38s, 500 more iterations: 7h 18m 11s. [2026-03-25 16:06:16,854][__main__][INFO] - Starting iteration 116. [2026-03-25 16:06:16,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:06:16,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:06:22,244][__main__][INFO] - Number of regex retries in iteration 116: 0 [2026-03-25 16:06:22,245][__main__][INFO] - agents played in iteration 116 are Alice, Bob [2026-03-25 16:06:22,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:06:22,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:06:22,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:06:22,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:06:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:24,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:26,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:27,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:29,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:06:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:06:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:06:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:06:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:06:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:06:36,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:06:36,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:06:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:06:38,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:06:38,948][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:06:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:06:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:06:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:06:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:06:42,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:06:42,904][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:06:43,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:06:44,222][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:06:44,882][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:06:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:06:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:06:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:06:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:06:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:06:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:06:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:06:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:06:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:06:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:06:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:06:52,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:57,746][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:58,404][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:59,723][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:03,018][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:05,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:06,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:07,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:07:08,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:08,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:08,302][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:09,576][__main__][INFO] - Iteration 117 took 52s (10.22% Gen, 87.36% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 52m 53s. Estimated total time: 14h 38m 39s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 19s. [2026-03-25 16:07:09,578][__main__][INFO] - Starting iteration 117. [2026-03-25 16:07:09,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:07:09,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:07:14,391][__main__][INFO] - Number of regex retries in iteration 117: 0 [2026-03-25 16:07:14,392][__main__][INFO] - agents played in iteration 117 are Alice, Bob [2026-03-25 16:07:14,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:07:15,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:07:15,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:07:15,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:07:15,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:07:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:07:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:07:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:07:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:07:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:07:19,818][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:07:20,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:07:21,136][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:07:21,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:07:22,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:07:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:07:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:07:24,429][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:25,088][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:26,406][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:28,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:29,700][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:30,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:31,677][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:07:35,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:07:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:07:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:07:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:07:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:07:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:07:39,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:07:40,243][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:07:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:07:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:07:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:07:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:07:43,536][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:07:44,195][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:07:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:07:45,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:07:46,173][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:07:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:07:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:07:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:07:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:07:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:07:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:07:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:07:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:56,400][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:57,059][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:58,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:59,172][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:08:00,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:00,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:00,406][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:04,745][__main__][INFO] - Iteration 118 took 55s (8.72% Gen, 83.41% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 32m 43s. Estimated total time: 15h 19m 24s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 42s. [2026-03-25 16:08:04,748][__main__][INFO] - Starting iteration 118. [2026-03-25 16:08:04,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:08:04,751][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:08:09,727][__main__][INFO] - Number of regex retries in iteration 118: 0 [2026-03-25 16:08:09,728][__main__][INFO] - agents played in iteration 118 are Alice, Bob [2026-03-25 16:08:10,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:08:10,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:08:10,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:08:10,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:08:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:08:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:08:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:08:13,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:08:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:08:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:08:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:08:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:08:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:08:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:08:17,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:08:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:08:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:08:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:08:20,410][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:08:21,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:08:21,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:08:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:08:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:08:23,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:08:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:08:25,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:08:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:08:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:08:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:08:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:08:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:08:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:08:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:08:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:30,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:33,586][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:35,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:36,881][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:37,540][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:38,857][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:40,832][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:41,490][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:08:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:08:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:08:43,799][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:08:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:08:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:08:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:08:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:08:47,092][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:08:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:08:48,411][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:08:49,070][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:08:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:08:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:08:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:08:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:08:52,363][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:08:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:08:53,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:08:54,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:08:55,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:55,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:55,601][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:56,986][__main__][INFO] - Iteration 119 took 52s (9.53% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 43m 3s. Estimated total time: 14h 30m 36s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 18s. [2026-03-25 16:08:56,988][__main__][INFO] - Starting iteration 119. [2026-03-25 16:08:56,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:08:56,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:09:01,927][__main__][INFO] - Number of regex retries in iteration 119: 0 [2026-03-25 16:09:01,929][__main__][INFO] - agents played in iteration 119 are Alice, Bob [2026-03-25 16:09:02,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:09:02,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:09:02,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:09:02,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:09:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:09:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:09:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:09:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:09:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:09:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:09:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:09:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:09:08,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:09:09,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:09:09,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:09:10,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:09:11,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:09:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:09:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:09:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:09:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:09:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:09:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:09:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:09:16,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:09:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:09:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:09:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:09:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:09:19,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:09:20,552][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:09:21,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:09:21,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:09:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:09:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:09:23,847][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:09:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:09:25,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:09:25,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:09:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:09:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:09:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:09:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:09:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:09:29,785][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:09:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:09:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:09:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:09:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:09:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:09:33,740][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:35,417][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:36,075][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:36,733][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:38,051][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:38,710][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:39,368][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:40,027][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:40,686][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:42,004][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:42,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:43,979][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:09:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:09:45,298][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:09:45,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:09:46,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:09:47,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:09:47,979][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:09:47,980][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:49,371][__main__][INFO] - Iteration 120 took 52s (9.42% Gen, 87.92% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 44m 34s. Estimated total time: 14h 33m 0s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 30s. [2026-03-25 16:09:49,373][__main__][INFO] - Starting iteration 120. [2026-03-25 16:09:49,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:09:49,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:00,399][__main__][INFO] - Number of regex retries in iteration 120: 0 [2026-03-25 16:10:00,400][__main__][INFO] - agents played in iteration 120 are Alice, Bob [2026-03-25 16:10:00,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:00,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:00,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:00,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:03,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:06,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:10:08,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:10:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:10:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:10:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:10:10,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:10:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:10:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:10:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:10:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:10:14,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:10:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:10:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:10:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:10:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:10:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:10:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:10:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:10:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:10:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:10:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:10:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:10:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:10:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:10:23,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:10:24,107][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:10:24,766][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:10:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:10:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:10:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:10:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:10:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:10:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:10:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:10:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:10:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:10:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:10:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:10:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:10:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:10:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:10:34,988][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:10:35,648][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:10:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:10:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:10:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:10:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:10:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:10:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:10:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:10:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:10:41,577][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:10:42,237][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:43,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:44,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:45,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:10:46,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:46,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:46,126][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:47,460][__main__][INFO] - Iteration 121 took 58s (18.98% Gen, 78.72% Train). Generation: 11s, Training: 45s. Estimated remaining time: 14h 18m 41s. Estimated total time: 16h 8m 4s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 48s, 500 more iterations: 8h 4m 2s. [2026-03-25 16:10:47,462][__main__][INFO] - Starting iteration 121. [2026-03-25 16:10:47,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:10:47,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:52,302][__main__][INFO] - Number of regex retries in iteration 121: 0 [2026-03-25 16:10:52,303][__main__][INFO] - agents played in iteration 121 are Alice, Bob [2026-03-25 16:10:52,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:52,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:52,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:52,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:53,725][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:58,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:06,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:11:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:11:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:11:08,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:11:09,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:11:10,182][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:11:10,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:11:11,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:11:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:11:12,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:11:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:11:14,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:11:14,797][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:11:15,456][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:11:16,117][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:11:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:11:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:11:18,098][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:11:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:11:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:11:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:11:20,734][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:11:21,394][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:11:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:11:22,712][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:11:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:11:24,031][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:11:24,690][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:11:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:11:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:11:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:11:27,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:11:28,329][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:11:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:11:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:11:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:11:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:11:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:11:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:11:32,944][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:11:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:11:34,262][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:11:34,922][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:11:35,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:11:36,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:11:37,037][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:11:38,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:11:38,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:11:38,154][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:11:39,615][__main__][INFO] - Iteration 122 took 52s (9.27% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 38m 54s. Estimated total time: 14h 29m 10s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 35s. [2026-03-25 16:11:39,618][__main__][INFO] - Starting iteration 122. [2026-03-25 16:11:39,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:11:39,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:45,330][__main__][INFO] - Number of regex retries in iteration 122: 0 [2026-03-25 16:11:45,330][__main__][INFO] - agents played in iteration 122 are Alice, Bob [2026-03-25 16:11:45,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:11:45,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:11:45,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:11:45,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:11:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:48,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:50,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:11:51,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:52,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:55,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:56,497][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:57,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:59,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:12:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:12:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:12:05,077][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:12:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:12:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:12:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:12:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:12:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:12:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:12:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:12:10,355][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:12:11,013][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:12:11,671][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:12:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:12:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:12:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:12:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:12:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:12:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:12:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:12:16,957][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:12:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:12:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:12:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:12:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:12:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:12:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:12:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:12:22,564][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:12:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:12:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:12:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:12:25,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:12:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:12:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:12:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:12:27,847][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:12:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:12:29,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:12:29,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:12:31,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:12:31,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:12:31,141][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:12:32,415][__main__][INFO] - Iteration 123 took 52s (10.81% Gen, 86.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 48m 46s. Estimated total time: 14h 39m 54s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 57s. [2026-03-25 16:12:32,417][__main__][INFO] - Starting iteration 123. [2026-03-25 16:12:32,421][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:12:32,421][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:12:43,758][__main__][INFO] - Number of regex retries in iteration 123: 0 [2026-03-25 16:12:43,760][__main__][INFO] - agents played in iteration 123 are Alice, Bob [2026-03-25 16:12:44,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:12:44,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:12:44,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:12:44,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:12:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:12:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:12:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:12:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:12:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:48,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:49,041][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:50,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:51,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:53,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:54,968][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:02,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:03,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:04,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:05,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:07,500][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:10,136][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:10,795][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:11,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:12,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:13:12,772][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:13:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:13:14,092][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:13:14,751][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:13:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:13:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:13:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:13:17,740][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:13:18,400][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:13:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:13:19,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:13:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:13:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:13:21,696][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:13:22,353][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:13:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:13:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:13:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:13:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:13:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:13:26,311][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:13:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:13:27,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:13:28,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:13:29,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:13:29,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:13:29,591][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:13:30,973][__main__][INFO] - Iteration 124 took 58s (19.36% Gen, 78.27% Train). Generation: 11s, Training: 45s. Estimated remaining time: 14h 23m 47s. Estimated total time: 16h 15m 54s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 35s, 500 more iterations: 8h 7m 57s. [2026-03-25 16:13:30,975][__main__][INFO] - Starting iteration 124. [2026-03-25 16:13:30,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:13:30,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:13:35,898][__main__][INFO] - Number of regex retries in iteration 124: 0 [2026-03-25 16:13:35,899][__main__][INFO] - agents played in iteration 124 are Alice, Bob [2026-03-25 16:13:36,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:13:36,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:13:36,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:13:36,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:13:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:13:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:13:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:13:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:13:39,887][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:13:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:13:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:13:41,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:13:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:13:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:13:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:44,497][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:46,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:47,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:49,770][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:50,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:53,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:55,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:55,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:56,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:57,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:58,338][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:00,974][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:01,633][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:02,950][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:04,926][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:05,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:08,219][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:10,527][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:13,163][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:13,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:14:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:14:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:14:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:14:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:14:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:14:18,435][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:14:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:14:19,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:14:20,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:14:21,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:14:21,565][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:14:21,566][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:14:22,920][__main__][INFO] - Iteration 125 took 51s (9.47% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 32m 44s. Estimated total time: 14h 25m 43s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 51s. [2026-03-25 16:14:22,923][__main__][INFO] - Starting iteration 125. [2026-03-25 16:14:22,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:14:22,927][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:14:27,872][__main__][INFO] - Number of regex retries in iteration 125: 0 [2026-03-25 16:14:27,873][__main__][INFO] - agents played in iteration 125 are Alice, Bob [2026-03-25 16:14:28,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:14:28,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:14:28,530][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:14:28,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:14:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:14:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:14:30,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:14:31,309][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:14:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:14:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:14:33,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:14:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:14:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:14:35,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:14:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:14:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:14:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:14:37,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:14:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:14:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:14:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:14:40,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:42,502][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:43,161][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:43,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:46,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:47,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:49,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:53,698][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:54,356][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:56,991][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:58,966][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:00,284][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:01,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:02,598][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:03,257][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:03,916][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:04,575][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:05,893][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:06,551][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:08,526][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:09,185][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:09,844][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:11,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:12,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:15:13,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:13,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:13,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:15,338][__main__][INFO] - Iteration 126 took 52s (9.44% Gen, 87.59% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 39m 41s. Estimated total time: 14h 33m 32s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 46s. [2026-03-25 16:15:15,340][__main__][INFO] - Starting iteration 126. [2026-03-25 16:15:15,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:15:15,345][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:15:21,181][__main__][INFO] - Number of regex retries in iteration 126: 0 [2026-03-25 16:15:21,183][__main__][INFO] - agents played in iteration 126 are Alice, Bob [2026-03-25 16:15:21,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:15:21,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:15:21,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:15:21,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:15:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:15:23,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:15:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:15:24,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:15:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:15:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:15:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:15:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:15:27,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:15:28,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:15:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:15:29,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:15:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:15:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:15:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:15:32,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:15:33,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:15:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:15:34,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:15:35,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:15:35,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:15:36,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:15:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:15:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:39,621][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:40,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:15:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:15:42,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:15:42,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:15:43,574][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:15:44,232][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:15:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:45,549][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:47,524][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:49,501][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:15:50,159][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:15:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:15:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:15:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:15:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:54,440][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:55,099][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:59,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:00,374][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:01,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:04,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:05,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:16:06,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:06,985][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:06,986][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:16:08,247][__main__][INFO] - Iteration 127 took 52s (11.04% Gen, 86.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 47m 0s. Estimated total time: 14h 41m 45s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 52s. [2026-03-25 16:16:08,249][__main__][INFO] - Starting iteration 127. [2026-03-25 16:16:08,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:16:08,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:13,132][__main__][INFO] - Number of regex retries in iteration 127: 0 [2026-03-25 16:16:13,133][__main__][INFO] - agents played in iteration 127 are Alice, Bob [2026-03-25 16:16:13,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:16:13,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:16:13,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:16:13,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:16:14,484][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:15,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:16,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:16:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:16:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:16:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:16:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:16:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:16:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:16:21,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:16:21,706][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:16:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:16:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:16:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:16:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:16:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:16:25,657][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:16:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:16:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:16:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:16:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:16:28,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:16:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:16:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:16:30,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:16:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:16:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:16:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:36,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:16:36,857][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:16:37,517][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:16:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:16:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:16:39,493][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:16:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:16:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:16:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:44,762][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:47,731][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:48,390][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:49,049][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:16:49,709][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:16:50,368][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:16:51,027][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:16:51,687][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:55,639][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:56,300][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:56,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:57,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:16:58,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:58,910][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:58,911][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:17:00,143][__main__][INFO] - Iteration 128 took 51s (9.40% Gen, 88.22% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 29m 15s. Estimated total time: 14h 24m 52s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 26s. [2026-03-25 16:17:00,145][__main__][INFO] - Starting iteration 128. [2026-03-25 16:17:00,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:17:00,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:17:15,205][__main__][INFO] - Number of regex retries in iteration 128: 0 [2026-03-25 16:17:15,206][__main__][INFO] - agents played in iteration 128 are Alice, Bob [2026-03-25 16:17:15,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:17:15,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:17:15,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:17:15,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:17:16,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:17:17,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:17:17,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:17:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:22,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:23,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:25,078][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:27,064][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:27,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:17:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:17:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:17:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:17:31,700][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:17:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:17:33,023][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:17:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:17:34,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:17:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:17:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:17:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:17:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:17:37,630][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:17:38,289][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:40,922][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:41,580][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:17:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:17:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:17:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:17:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:17:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:17:45,536][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:17:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:17:46,861][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:17:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:17:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:17:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:17:49,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:17:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:17:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:52,507][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:17:54,488][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:17:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:17:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:17:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:17:57,134][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:17:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:17:58,453][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:17:59,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:00,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:18:01,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:01,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:01,047][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:04,426][__main__][INFO] - Iteration 129 took 1m 4s (23.42% Gen, 71.32% Train). Generation: 15s, Training: 45s. Estimated remaining time: 15h 54m 37s. Estimated total time: 17h 51m 18s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 7s, 500 more iterations: 8h 55m 39s. [2026-03-25 16:18:04,428][__main__][INFO] - Starting iteration 129. [2026-03-25 16:18:04,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:18:04,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:18:09,830][__main__][INFO] - Number of regex retries in iteration 129: 0 [2026-03-25 16:18:09,833][__main__][INFO] - agents played in iteration 129 are Alice, Bob [2026-03-25 16:18:10,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:18:10,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:18:10,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:18:10,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:18:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:18:11,942][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:18:12,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:18:13,263][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:18:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:18:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:18:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:18:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:18:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:18:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:18:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:18:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:18:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:18:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:18:20,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:18:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:18:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:18:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:18:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:18:23,815][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:24,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:25,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:25,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:26,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:27,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:27,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:28,429][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:31,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:32,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:18:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:18:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:18:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:37,656][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:38,315][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:38,974][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:40,292][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:18:40,951][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:18:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:18:42,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:18:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:18:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:18:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:18:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:18:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:18:46,553][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:18:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:18:47,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:18:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:51,167][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:18:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:18:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:18:53,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:54,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:18:55,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:55,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:55,736][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:57,046][__main__][INFO] - Iteration 130 took 52s (10.26% Gen, 87.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 39m 21s. Estimated total time: 14h 36m 54s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 41s, 500 more iterations: 7h 18m 27s. [2026-03-25 16:18:57,048][__main__][INFO] - Starting iteration 130. [2026-03-25 16:18:57,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:18:57,052][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:02,000][__main__][INFO] - Number of regex retries in iteration 130: 0 [2026-03-25 16:19:02,001][__main__][INFO] - agents played in iteration 130 are Alice, Bob [2026-03-25 16:19:02,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:02,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:02,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:19:02,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:19:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:04,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:04,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:06,033][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:19:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:19:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:19:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:19:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:19:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:19:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:19:10,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:19:11,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:19:11,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:19:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:19:13,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:19:13,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:19:14,598][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:19:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:19:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:19:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:19:17,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:19:17,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:19:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:19:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:19:19,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:19:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:19:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:19:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:19:22,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:19:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:19:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:19:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:19:25,152][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:19:25,811][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:19:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:30,427][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:31,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:34,381][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:36,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:37,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:19:38,016][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:19:38,676][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:19:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:19:39,993][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:19:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:19:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:19:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:19:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:19:43,287][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:19:43,946][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:45,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:46,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:19:48,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:48,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:48,132][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:49,411][__main__][INFO] - Iteration 131 took 52s (9.45% Gen, 88.10% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 34m 16s. Estimated total time: 14h 32m 41s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 20s. [2026-03-25 16:19:49,415][__main__][INFO] - Starting iteration 131. [2026-03-25 16:19:49,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:19:49,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:55,430][__main__][INFO] - Number of regex retries in iteration 131: 0 [2026-03-25 16:19:55,431][__main__][INFO] - agents played in iteration 131 are Alice, Bob [2026-03-25 16:19:55,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:55,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:55,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:19:55,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:19:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:58,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:59,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:01,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:03,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:06,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:20:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:20:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:20:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:20:09,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:20:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:20:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:20:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:20:12,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:20:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:20:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:20:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:20:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:20:15,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:20:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:20:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:20:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:20:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:20:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:20:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:20:20,517][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:20:21,175][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:20:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:20:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:20:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:20:23,809][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:20:24,469][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:20:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:20:25,786][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:20:26,447][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:20:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:20:27,764][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:20:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:20:29,416][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:20:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:31,392][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:33,370][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:34,031][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:36,666][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:37,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:39,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:40,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:20:41,350][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:20:41,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:20:41,355][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:20:42,748][__main__][INFO] - Iteration 132 took 53s (11.27% Gen, 86.11% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 49m 32s. Estimated total time: 14h 48m 51s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 53s, 500 more iterations: 7h 24m 25s. [2026-03-25 16:20:42,752][__main__][INFO] - Starting iteration 132. [2026-03-25 16:20:42,756][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:20:42,756][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:47,624][__main__][INFO] - Number of regex retries in iteration 132: 0 [2026-03-25 16:20:47,626][__main__][INFO] - agents played in iteration 132 are Alice, Bob [2026-03-25 16:20:48,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:48,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:48,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:20:48,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:20:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:20:49,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:20:50,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:20:50,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:20:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:53,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:54,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:55,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:56,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:56,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:58,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:00,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:02,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:06,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:21:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:21:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:21:09,353][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:21:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:21:10,671][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:21:11,332][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:21:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:21:12,652][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:21:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:21:13,971][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:21:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:21:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:21:15,951][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:21:16,611][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:21:17,270][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:21:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:21:18,590][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:21:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:21:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:21:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:21:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:21:22,214][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:21:22,876][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:21:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:21:24,195][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:21:24,856][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:21:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:26,834][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:21:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:21:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:21:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:21:31,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:21:32,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:21:33,393][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:33,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:33,398][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:34,631][__main__][INFO] - Iteration 133 took 51s (9.39% Gen, 88.23% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 24m 27s. Estimated total time: 14h 24m 37s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 18s. [2026-03-25 16:21:34,634][__main__][INFO] - Starting iteration 133. [2026-03-25 16:21:34,637][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:21:34,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:39,507][__main__][INFO] - Number of regex retries in iteration 133: 0 [2026-03-25 16:21:39,509][__main__][INFO] - agents played in iteration 133 are Alice, Bob [2026-03-25 16:21:40,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:21:40,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:21:40,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:21:40,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:21:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:21:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:21:42,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:21:42,906][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:21:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:21:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:21:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:21:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:21:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:21:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:21:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:21:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:21:48,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:21:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:21:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:21:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:21:51,472][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:52,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:52,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:55,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:58,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:04,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:06,643][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:22:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:22:08,621][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:22:09,280][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:22:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:22:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:22:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:22:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:22:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:22:13,568][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:22:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:22:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:22:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:22:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:22:16,868][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:22:17,528][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:22:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:22:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:22:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:22:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:22:20,827][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:21,487][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:22,146][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:23,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:24,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:22:25,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:22:25,659][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:22:25,660][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:22:27,058][__main__][INFO] - Iteration 134 took 52s (9.29% Gen, 88.04% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 32m 39s. Estimated total time: 14h 33m 42s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 51s. [2026-03-25 16:22:27,063][__main__][INFO] - Starting iteration 134. [2026-03-25 16:22:27,066][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:22:27,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:22:32,128][__main__][INFO] - Number of regex retries in iteration 134: 0 [2026-03-25 16:22:32,130][__main__][INFO] - agents played in iteration 134 are Alice, Bob [2026-03-25 16:22:32,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:22:32,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:22:32,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:22:32,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:22:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:35,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:38,172][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:39,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:40,150][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:42,126][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:22:42,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:22:43,444][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:22:44,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:22:44,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:22:45,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:22:46,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:22:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:22:47,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:22:48,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:22:48,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:22:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:22:50,035][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:22:50,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:22:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:22:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:53,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:55,971][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:56,631][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:57,291][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:03,224][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:06,196][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:06,856][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:23:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:23:08,835][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:23:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:23:10,155][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:23:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:23:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:23:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:23:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:23:13,453][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:23:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:23:14,773][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:23:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:23:16,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:23:16,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:23:17,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:17,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:17,970][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:19,303][__main__][INFO] - Iteration 135 took 52s (9.69% Gen, 87.75% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 28m 43s. Estimated total time: 14h 30m 38s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 19s. [2026-03-25 16:23:19,306][__main__][INFO] - Starting iteration 135. [2026-03-25 16:23:19,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:23:19,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:24,241][__main__][INFO] - Number of regex retries in iteration 135: 0 [2026-03-25 16:23:24,242][__main__][INFO] - agents played in iteration 135 are Alice, Bob [2026-03-25 16:23:24,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:24,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:24,918][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:23:24,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:23:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:26,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:27,021][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:29,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:23:30,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:23:31,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:23:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:23:32,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:23:33,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:23:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:35,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:36,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:38,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:44,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:23:45,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:23:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:23:46,805][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:23:47,465][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:23:48,126][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:23:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:23:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:23:50,103][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:23:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:23:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:23:52,080][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:54,720][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:56,040][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:59,020][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:24:08,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:24:09,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:24:10,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:24:10,172][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:24:10,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:24:11,577][__main__][INFO] - Iteration 136 took 52s (9.44% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 28m 21s. Estimated total time: 14h 31m 9s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 34s. [2026-03-25 16:24:11,580][__main__][INFO] - Starting iteration 136. [2026-03-25 16:24:11,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:24:11,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:24:16,433][__main__][INFO] - Number of regex retries in iteration 136: 0 [2026-03-25 16:24:16,434][__main__][INFO] - agents played in iteration 136 are Alice, Bob [2026-03-25 16:24:16,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:24:17,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:24:17,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:24:17,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:24:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:24:18,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:24:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:24:19,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:24:20,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:24:21,155][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:24:21,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:24:22,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:24:23,134][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:24:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:24:24,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:24:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:24:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:24:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:24:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:24:27,762][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:24:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:24:29,083][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:24:29,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:24:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:24:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:24:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:24:32,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:24:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:24:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:24:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:24:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:24:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:24:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:24:37,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:38,982][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:39,641][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:40,301][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:41,619][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:42,278][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:42,940][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:45,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:24:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:24:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:24:48,221][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:24:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:24:49,869][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:24:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:24:51,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:24:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:52,510][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:55,807][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:57,130][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:57,790][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:00,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:01,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:25:02,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:02,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:02,330][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:03,780][__main__][INFO] - Iteration 137 took 52s (9.29% Gen, 87.93% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 26m 17s. Estimated total time: 14h 29m 57s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 59s, 500 more iterations: 7h 14m 58s. [2026-03-25 16:25:03,782][__main__][INFO] - Starting iteration 137. [2026-03-25 16:25:03,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:03,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:25:08,786][__main__][INFO] - Number of regex retries in iteration 137: 0 [2026-03-25 16:25:08,788][__main__][INFO] - agents played in iteration 137 are Alice, Bob [2026-03-25 16:25:09,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:25:09,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:25:09,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:25:09,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:25:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:25:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:25:11,550][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:25:12,210][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:25:12,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:25:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:25:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:25:14,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:25:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:25:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:25:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:25:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:25:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:25:18,802][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:25:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:25:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:25:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:25:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:25:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:25:22,755][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:25:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:25:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:25:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:25:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:25:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:25:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:25:27,369][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:25:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:25:28,689][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:25:29,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:25:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:25:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:25:31,324][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:25:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:25:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:25:33,301][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:25:33,960][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:25:34,618][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:25:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:25:35,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:25:36,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:25:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:25:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:25:38,574][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:39,893][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:41,211][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:47,480][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:48,140][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:25:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:25:50,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:25:50,779][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:25:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:25:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:52,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:53,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:25:54,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:54,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:54,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:56,282][__main__][INFO] - Iteration 138 took 52s (9.53% Gen, 87.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 30m 26s. Estimated total time: 14h 34m 58s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 29s. [2026-03-25 16:25:56,286][__main__][INFO] - Starting iteration 138. [2026-03-25 16:25:56,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:56,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:09,965][__main__][INFO] - Number of regex retries in iteration 138: 0 [2026-03-25 16:26:09,967][__main__][INFO] - agents played in iteration 138 are Alice, Bob [2026-03-25 16:26:10,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:26:10,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:26:10,600][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:26:10,600][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:26:11,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:13,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:14,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:26:14,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:26:15,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:26:16,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:26:16,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:26:17,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:26:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:26:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:26:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:26:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:26:20,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:26:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:26:22,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:26:22,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:26:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:26:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:26:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:26:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:26:25,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:26:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:26:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:26:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:26:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:26:29,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:26:29,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:26:30,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:26:31,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:26:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:26:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:26:33,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:26:33,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:26:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:26:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:26:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:26:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:26:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:26:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:26:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:26:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:26:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:26:40,451][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:26:41,109][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:26:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:26:42,427][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:26:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:26:44,083][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:26:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:26:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:26:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:26:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:26:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:26:48,038][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:26:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:26:49,357][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:26:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:50,675][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:51,333][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:53,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:54,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:26:55,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:55,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:55,905][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:57,649][__main__][INFO] - Iteration 139 took 1m 1s (22.29% Gen, 74.86% Train). Generation: 13s, Training: 45s. Estimated remaining time: 14h 57m 8s. Estimated total time: 17h 2m 41s. Time estimates for 10 more iterations: 10m 13s, 100 more iterations: 1h 42m 16s, 500 more iterations: 8h 31m 20s. [2026-03-25 16:26:57,651][__main__][INFO] - Starting iteration 139. [2026-03-25 16:26:57,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:26:57,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:27:03,780][__main__][INFO] - Number of regex retries in iteration 139: 0 [2026-03-25 16:27:03,781][__main__][INFO] - agents played in iteration 139 are Alice, Bob [2026-03-25 16:27:04,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:27:04,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:27:04,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:27:04,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:27:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:27:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:27:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:27:07,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:27:07,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:08,467][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:27:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:27:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:27:11,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:27:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:27:12,420][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:27:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:27:13,737][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:27:14,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:27:15,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:27:15,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:27:16,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:27:17,029][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:27:17,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:27:18,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:27:19,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:27:19,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:27:20,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:27:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:27:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:27:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:27:22,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:27:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:27:24,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:27:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:27:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:27:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:27:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:27:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:27:28,244][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:27:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:27:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:27:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:27:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:27:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:27:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:27:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:27:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:27:34,178][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:27:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:27:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:27:36,154][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:27:37,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:27:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:27:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:27:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:27:39,810][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:27:40,469][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:27:41,128][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:27:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:27:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:27:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:27:43,768][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:27:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:27:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:27:45,748][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:27:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:27:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:27:47,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:27:48,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:28:00,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:00,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:00,993][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:28:02,352][__main__][INFO] - Iteration 140 took 1m 4s (9.47% Gen, 88.43% Train). Generation: 6s, Training: 57s. Estimated remaining time: 15h 51m 40s. Estimated total time: 17h 58m 18s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 49s, 500 more iterations: 8h 59m 9s. [2026-03-25 16:28:02,355][__main__][INFO] - Starting iteration 140. [2026-03-25 16:28:02,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:28:02,360][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:09,026][__main__][INFO] - Number of regex retries in iteration 140: 0 [2026-03-25 16:28:09,027][__main__][INFO] - agents played in iteration 140 are Alice, Bob [2026-03-25 16:28:09,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:28:09,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:28:09,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:28:09,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:28:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:28:11,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:28:11,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:28:12,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:28:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:28:13,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:28:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:17,095][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:20,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:21,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:21,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:22,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:23,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:25,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:28:25,661][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:28:26,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:28:26,979][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:28:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:28:28,296][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:28:28,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:28:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:28:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:28:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:28:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:28:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:28:32,907][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:28:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:28:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:28:34,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:28:35,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:28:36,203][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:28:36,862][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:28:37,522][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:28:38,184][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:28:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:28:39,509][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:28:40,169][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:28:40,834][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:28:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:28:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:28:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:28:43,822][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:28:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:28:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:28:45,803][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:28:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:28:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:28:47,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:28:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:28:49,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:28:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:28:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:28:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:28:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:28:52,401][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:28:53,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:28:53,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:28:54,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:54,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:54,997][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:28:58,355][__main__][INFO] - Iteration 141 took 55s (11.91% Gen, 82.09% Train). Generation: 6s, Training: 45s. Estimated remaining time: 13h 25m 44s. Estimated total time: 15h 33m 18s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 39s. [2026-03-25 16:28:58,358][__main__][INFO] - Starting iteration 141. [2026-03-25 16:28:58,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:28:58,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:03,367][__main__][INFO] - Number of regex retries in iteration 141: 0 [2026-03-25 16:29:03,368][__main__][INFO] - agents played in iteration 141 are Alice, Bob [2026-03-25 16:29:03,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:04,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:04,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:29:04,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:29:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:29:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:29:08,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:29:09,301][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:29:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:29:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:29:11,279][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:29:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:29:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:29:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:29:13,912][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:29:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:29:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:29:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:29:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:29:17,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:29:17,865][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:29:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:29:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:19,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:20,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:23,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:25,123][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:25,782][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:27,762][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:28,423][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:29,083][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:29:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:29:31,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:29:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:29:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:29:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:29:33,698][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:29:34,356][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:29:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:29:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:29:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:29:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:29:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:29:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:29:39,304][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:29:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:29:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:29:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:29:41,941][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:29:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:29:43,259][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:29:43,918][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:29:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:29:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:29:45,895][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:29:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:29:47,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:29:48,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:29:49,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:29:49,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:29:49,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:50,545][__main__][INFO] - Iteration 142 took 52s (9.59% Gen, 87.83% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 21m 18s. Estimated total time: 14h 29m 45s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 52s. [2026-03-25 16:29:50,548][__main__][INFO] - Starting iteration 142. [2026-03-25 16:29:50,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:29:50,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:55,802][__main__][INFO] - Number of regex retries in iteration 142: 0 [2026-03-25 16:29:55,803][__main__][INFO] - agents played in iteration 142 are Alice, Bob [2026-03-25 16:29:56,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:56,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:56,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:29:56,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:29:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:58,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:59,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:59,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:00,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:04,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:05,647][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:06,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:06,965][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:30:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:30:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:30:08,942][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:30:09,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:30:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:30:10,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:30:11,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:30:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:30:12,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:30:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:30:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:30:14,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:30:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:30:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:30:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:30:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:30:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:30:18,831][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:30:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:30:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:30:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:30:21,470][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:30:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:23,448][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:24,109][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:24,768][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:25,427][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:28,067][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:29,728][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:30,389][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:31,048][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:32,368][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:30:33,029][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:30:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:30:34,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:30:35,008][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:30:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:30:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:30:36,989][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:30:37,649][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:30:38,315][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:30:38,973][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:30:39,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:30:40,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:30:41,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:30:41,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:30:41,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:30:42,944][__main__][INFO] - Iteration 143 took 52s (10.02% Gen, 87.53% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 23m 55s. Estimated total time: 14h 33m 14s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 37s. [2026-03-25 16:30:42,946][__main__][INFO] - Starting iteration 143. [2026-03-25 16:30:42,950][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:30:42,951][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:48,250][__main__][INFO] - Number of regex retries in iteration 143: 0 [2026-03-25 16:30:48,251][__main__][INFO] - agents played in iteration 143 are Alice, Bob [2026-03-25 16:30:48,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:48,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:48,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:30:48,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:30:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:50,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:50,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:55,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:56,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:59,492][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:02,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:05,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:06,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:06,751][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:07,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:31:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:31:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:31:10,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:31:10,711][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:31:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:31:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:31:12,690][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:31:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:31:14,010][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:31:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:31:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:31:15,990][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:31:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:31:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:31:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:31:18,642][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:31:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:31:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:31:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:31:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:31:22,280][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:31:22,942][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:31:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:31:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:31:24,926][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:26,249][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:32,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:33,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:31:34,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:34,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:34,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:31:35,408][__main__][INFO] - Iteration 144 took 52s (10.10% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 24m 8s. Estimated total time: 14h 34m 19s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 9s. [2026-03-25 16:31:35,410][__main__][INFO] - Starting iteration 144. [2026-03-25 16:31:35,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:31:35,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:31:40,140][__main__][INFO] - Number of regex retries in iteration 144: 0 [2026-03-25 16:31:40,141][__main__][INFO] - agents played in iteration 144 are Alice, Bob [2026-03-25 16:31:40,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:31:40,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:31:40,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:31:40,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:31:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:31:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:31:42,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:31:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:31:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:44,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:45,461][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:47,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:48,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:53,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:54,046][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:54,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:55,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:56,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:58,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:00,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:01,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:03,284][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:03,944][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:05,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:06,582][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:07,241][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:32:07,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:32:08,558][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:32:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:32:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:32:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:32:11,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:32:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:32:12,514][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:32:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:32:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:32:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:32:15,488][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:32:16,149][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:32:16,808][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:32:17,469][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:32:18,128][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:32:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:32:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:32:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:32:20,765][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:32:21,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:32:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:32:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:32:23,402][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:32:24,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:32:24,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:32:26,109][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:32:26,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:32:26,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:27,423][__main__][INFO] - Iteration 145 took 52s (9.09% Gen, 88.41% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 15m 47s. Estimated total time: 14h 26m 51s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 25s. [2026-03-25 16:32:27,426][__main__][INFO] - Starting iteration 145. [2026-03-25 16:32:27,430][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:32:27,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:32,226][__main__][INFO] - Number of regex retries in iteration 145: 0 [2026-03-25 16:32:32,227][__main__][INFO] - agents played in iteration 145 are Alice, Bob [2026-03-25 16:32:32,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:32:32,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:32:32,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:32:32,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:32:33,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:32:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:32:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:32:35,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:32:36,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:32:36,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:32:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:32:38,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:32:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:32:39,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:32:40,280][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:32:40,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:32:41,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:32:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:32:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:44,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:44,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:46,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:47,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:48,187][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:48,846][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:49,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:32:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:32:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:32:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:55,435][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:59,394][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:00,053][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:02,029][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:04,005][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:06,324][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:07,642][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:33:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:33:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:33:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:33:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:33:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:33:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:33:12,256][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:33:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:33:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:33:14,232][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:33:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:33:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:33:16,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:33:17,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:33:18,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:33:18,145][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:33:18,147][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:33:19,436][__main__][INFO] - Iteration 146 took 52s (9.22% Gen, 88.29% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 14m 52s. Estimated total time: 14h 26m 47s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 23s. [2026-03-25 16:33:19,441][__main__][INFO] - Starting iteration 146. [2026-03-25 16:33:19,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:33:19,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:33:24,480][__main__][INFO] - Number of regex retries in iteration 146: 0 [2026-03-25 16:33:24,481][__main__][INFO] - agents played in iteration 146 are Alice, Bob [2026-03-25 16:33:25,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:33:25,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:33:25,132][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:33:25,133][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:33:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:26,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:27,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:29,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:30,535][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:31,852][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:33,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:34,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:35,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:35,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:33:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:33:37,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:33:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:39,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:39,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:40,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:33:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:33:41,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:33:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:44,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:45,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:33:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:33:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:33:46,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:33:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:33:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:33:48,973][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:33:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:33:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:33:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:33:51,607][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:52,266][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:52,926][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:53,584][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:54,243][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:54,902][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:55,561][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:56,220][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:58,539][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:59,858][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:00,516][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:03,156][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:03,815][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:07,769][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:34:08,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:34:09,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:34:10,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:34:10,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:34:10,253][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:34:11,563][__main__][INFO] - Iteration 147 took 52s (9.61% Gen, 87.87% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 15m 22s. Estimated total time: 14h 28m 10s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 5s. [2026-03-25 16:34:11,565][__main__][INFO] - Starting iteration 147. [2026-03-25 16:34:11,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:34:11,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:34:16,363][__main__][INFO] - Number of regex retries in iteration 147: 0 [2026-03-25 16:34:16,364][__main__][INFO] - agents played in iteration 147 are Alice, Bob [2026-03-25 16:34:16,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:34:16,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:34:16,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:34:16,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:34:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:34:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:34:19,075][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:34:19,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:34:20,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:34:21,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:34:21,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:34:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:34:23,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:34:23,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:34:24,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:34:25,012][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:34:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:34:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:34:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:34:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:28,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:28,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:30,283][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:34:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:34:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:33,577][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:35,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:36,210][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:34:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:38,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:34:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:34:39,505][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:34:40,164][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:34:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:34:41,482][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:34:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:34:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:34:43,459][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:34:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:34:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:34:45,437][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:34:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:34:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:34:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:34:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:34:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:34:49,731][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:34:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:34:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:34:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:00,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:01,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:35:02,195][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:02,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:02,203][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:03,547][__main__][INFO] - Iteration 148 took 51s (9.22% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 12m 38s. Estimated total time: 14h 26m 17s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 8s. [2026-03-25 16:35:03,550][__main__][INFO] - Starting iteration 148. [2026-03-25 16:35:03,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:35:03,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:35:08,600][__main__][INFO] - Number of regex retries in iteration 148: 0 [2026-03-25 16:35:08,601][__main__][INFO] - agents played in iteration 148 are Alice, Bob [2026-03-25 16:35:09,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:35:09,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:35:09,242][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:35:09,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:35:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:35:10,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:35:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:35:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:35:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:35:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:35:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:35:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:35:15,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:35:15,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:35:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:35:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:35:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:35:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:35:19,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:35:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:35:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:35:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:35:21,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:35:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:35:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:35:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:35:24,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:35:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:35:25,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:35:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:35:27,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:35:27,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:35:28,381][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:35:29,040][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:35:29,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:35:30,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:33,654][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:34,314][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:36,950][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:38,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:39,587][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:40,246][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:35:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:35:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:35:43,212][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:35:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:35:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:35:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:35:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:35:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:35:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:35:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:35:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:35:49,145][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:35:49,805][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:35:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:35:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:35:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:52,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:53,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:35:54,477][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:54,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:54,508][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:56,105][__main__][INFO] - Iteration 149 took 52s (9.60% Gen, 87.35% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 21m 20s. Estimated total time: 14h 35m 53s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 56s. [2026-03-25 16:35:56,107][__main__][INFO] - Starting iteration 149. [2026-03-25 16:35:56,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:35:56,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:01,053][__main__][INFO] - Number of regex retries in iteration 149: 0 [2026-03-25 16:36:01,054][__main__][INFO] - agents played in iteration 149 are Alice, Bob [2026-03-25 16:36:01,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:01,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:01,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:36:01,587][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:36:02,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:02,932][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:03,593][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:05,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:06,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:36:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:36:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:36:09,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:36:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:36:10,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:36:11,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:36:12,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:36:12,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:13,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:14,144][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:14,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:18,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:36:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:36:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:36:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:36:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:36:24,031][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:36:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:36:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:36:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:36:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:36:27,331][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:36:27,990][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:36:28,649][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:36:29,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:36:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:36:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:36:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:36:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:36:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:36:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:34,257][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:36,895][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:37,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:40,194][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:40,854][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:41,514][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:42,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:43,494][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:36:44,156][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:36:44,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:36:45,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:36:46,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:36:46,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:36:46,703][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:36:48,059][__main__][INFO] - Iteration 150 took 51s (9.51% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 10m 25s. Estimated total time: 14h 25m 49s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 54s. [2026-03-25 16:36:48,061][__main__][INFO] - Starting iteration 150. [2026-03-25 16:36:48,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:36:48,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:52,940][__main__][INFO] - Number of regex retries in iteration 150: 0 [2026-03-25 16:36:52,941][__main__][INFO] - agents played in iteration 150 are Alice, Bob [2026-03-25 16:36:53,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:53,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:53,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:36:53,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:36:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:57,611][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:58,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:00,250][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:00,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:02,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:02,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:04,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:37:06,845][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:37:07,504][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:37:08,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:37:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:37:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:37:10,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:14,099][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:16,077][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:37:17,394][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:37:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:37:18,714][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:37:19,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:37:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:37:20,692][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:37:21,350][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:37:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:37:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:37:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:37:23,989][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:37:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:37:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:37:26,311][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:37:26,971][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:37:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:37:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:37:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:37:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:37:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:37:30,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:37:31,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:37:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:37:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:37:33,571][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:37:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:37:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:37:35,553][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:36,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:36,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:37,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:37:38,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:38,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:38,774][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:41,810][__main__][INFO] - Iteration 151 took 53s (9.07% Gen, 85.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 39m 28s. Estimated total time: 14h 55m 45s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 52s. [2026-03-25 16:37:41,812][__main__][INFO] - Starting iteration 151. [2026-03-25 16:37:41,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:37:41,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:37:46,663][__main__][INFO] - Number of regex retries in iteration 151: 0 [2026-03-25 16:37:46,665][__main__][INFO] - agents played in iteration 151 are Alice, Bob [2026-03-25 16:37:47,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:37:47,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:37:47,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:37:47,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:37:48,104][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:37:48,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:37:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:37:50,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:37:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:37:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:37:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:37:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:37:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:54,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:58,650][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:59,309][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:02,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:03,276][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:03,937][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:38:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:38:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:38:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:38:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:38:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:38:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:38:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:11,857][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:38:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:38:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:38:13,838][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:38:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:38:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:38:15,818][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:38:16,478][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:38:17,138][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:38:17,799][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:38:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:38:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:38:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:38:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:38:21,435][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:38:22,101][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:38:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:38:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:38:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:38:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:38:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:38:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:38:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:38:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:38:28,040][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:38:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:38:29,359][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:38:30,019][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:38:30,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:38:31,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:38:32,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:38:32,538][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:38:32,539][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:38:33,782][__main__][INFO] - Iteration 152 took 51s (9.33% Gen, 88.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 8m 57s. Estimated total time: 14h 26m 7s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 3s. [2026-03-25 16:38:33,784][__main__][INFO] - Starting iteration 152. [2026-03-25 16:38:33,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:38:33,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:39,113][__main__][INFO] - Number of regex retries in iteration 152: 0 [2026-03-25 16:38:39,114][__main__][INFO] - agents played in iteration 152 are Alice, Bob [2026-03-25 16:38:39,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:38:39,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:38:39,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:38:39,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:38:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:43,055][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:45,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:46,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:47,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:38:47,671][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:38:48,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:38:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:38:49,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:38:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:38:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:38:51,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:38:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:53,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:57,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:38:58,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:38:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:00,860][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:06,132][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:06,791][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:07,449][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:39:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:39:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:39:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:39:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:39:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:39:11,407][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:39:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:39:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:39:13,715][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:39:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:39:15,032][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:39:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:39:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:39:17,012][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:39:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:39:18,330][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:39:18,989][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:39:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:39:20,311][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:39:20,969][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:39:21,630][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:39:22,290][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:39:22,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:39:23,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:39:24,879][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:39:24,882][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:39:24,884][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:39:26,319][__main__][INFO] - Iteration 153 took 52s (10.13% Gen, 87.13% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 17m 30s. Estimated total time: 14h 35m 32s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 33s, 500 more iterations: 7h 17m 46s. [2026-03-25 16:39:26,322][__main__][INFO] - Starting iteration 153. [2026-03-25 16:39:26,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:39:26,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:39:31,124][__main__][INFO] - Number of regex retries in iteration 153: 0 [2026-03-25 16:39:31,125][__main__][INFO] - agents played in iteration 153 are Alice, Bob [2026-03-25 16:39:31,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:39:31,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:39:31,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:39:31,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:39:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:39:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:39:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:39:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:39:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:39:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:39:36,551][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:39:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:39:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:39:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:39:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:40,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:43,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:46,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:39:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:39:47,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:39:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:39:49,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:39:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:39:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:39:51,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:39:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:53,689][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:55,664][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:56,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:02,913][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:05,884][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:06,543][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:07,861][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:40:08,520][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:40:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:40:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:40:10,500][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:40:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:40:11,818][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:40:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:40:13,138][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:40:13,798][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:40:14,457][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:40:15,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:40:15,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:40:17,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:40:17,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:40:17,104][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:40:18,357][__main__][INFO] - Iteration 154 took 52s (9.22% Gen, 88.37% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 8m 18s. Estimated total time: 14h 27m 12s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 36s. [2026-03-25 16:40:18,359][__main__][INFO] - Starting iteration 154. [2026-03-25 16:40:18,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:40:18,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:40:23,269][__main__][INFO] - Number of regex retries in iteration 154: 0 [2026-03-25 16:40:23,270][__main__][INFO] - agents played in iteration 154 are Alice, Bob [2026-03-25 16:40:23,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:40:23,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:40:23,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:40:23,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:40:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:40:25,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:40:25,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:40:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:40:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:40:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:40:28,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:40:29,125][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:40:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:40:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:40:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:40:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:40:32,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:40:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:40:33,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:40:34,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:40:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:40:35,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:40:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:40:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:40:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:40:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:39,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:46,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:50,892][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:51,551][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:52,869][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:53,528][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:54,846][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:56,503][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:01,119][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:01,778][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:02,438][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:03,099][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:05,078][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:06,397][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:07,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:07,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:41:08,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:41:08,892][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:41:08,893][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:41:10,241][__main__][INFO] - Iteration 155 took 51s (9.46% Gen, 87.94% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 4m 54s. Estimated total time: 14h 24m 40s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 20s. [2026-03-25 16:41:10,244][__main__][INFO] - Starting iteration 155. [2026-03-25 16:41:10,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:41:10,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:41:14,941][__main__][INFO] - Number of regex retries in iteration 155: 0 [2026-03-25 16:41:14,942][__main__][INFO] - agents played in iteration 155 are Alice, Bob [2026-03-25 16:41:15,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:41:15,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:41:15,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:41:15,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:41:16,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:41:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:41:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:41:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:41:18,953][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:41:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:41:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:41:20,931][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:41:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:41:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:41:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:41:23,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:41:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:41:24,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:41:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:41:26,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:41:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:41:27,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:41:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:41:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:41:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:41:30,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:41:30,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:41:31,477][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:41:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:41:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:41:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:38,068][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:41:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:41:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:41:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:41:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:41:43,343][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:45,320][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:47,297][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:48,298][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:48,959][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:49,618][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:50,277][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:50,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:54,898][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:57,537][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:58,196][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:58,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:59,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:42:00,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:00,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:00,755][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:02,295][__main__][INFO] - Iteration 156 took 52s (9.07% Gen, 87.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 6m 50s. Estimated total time: 14h 27m 28s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 44s. [2026-03-25 16:42:02,298][__main__][INFO] - Starting iteration 156. [2026-03-25 16:42:02,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:42:02,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:42:07,098][__main__][INFO] - Number of regex retries in iteration 156: 0 [2026-03-25 16:42:07,099][__main__][INFO] - agents played in iteration 156 are Alice, Bob [2026-03-25 16:42:07,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:42:07,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:42:07,653][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:42:07,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:42:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:42:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:42:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:42:10,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:42:11,042][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:42:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:42:12,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:42:13,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:42:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:42:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:42:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:42:15,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:42:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:42:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:42:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:42:18,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:42:18,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:42:19,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:42:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:42:20,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:42:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:42:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:42:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:42:23,577][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:42:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:42:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:42:25,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:42:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:42:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:42:27,534][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:42:28,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:42:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:42:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:42:30,172][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:42:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:42:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:42:32,150][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:34,131][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:36,777][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:42:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:42:38,758][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:42:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:42:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:42:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:42:44,382][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:42:45,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:42:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:46,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:47,019][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:49,662][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:50,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:51,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:42:52,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:52,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:52,892][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:54,310][__main__][INFO] - Iteration 157 took 52s (9.22% Gen, 88.05% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 5m 18s. Estimated total time: 14h 26m 49s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 24s. [2026-03-25 16:42:54,314][__main__][INFO] - Starting iteration 157. [2026-03-25 16:42:54,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:42:54,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:00,591][__main__][INFO] - Number of regex retries in iteration 157: 0 [2026-03-25 16:43:00,593][__main__][INFO] - agents played in iteration 157 are Alice, Bob [2026-03-25 16:43:01,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:01,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:01,399][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:43:01,400][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:43:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:03,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:04,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:05,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:43:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:43:08,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:43:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:43:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:43:10,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:43:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:43:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:43:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:43:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:43:13,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:43:14,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:43:14,686][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:43:15,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:43:16,005][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:43:16,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:43:17,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:43:17,984][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:43:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:43:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:43:19,963][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:43:20,622][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:43:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:43:21,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:43:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:43:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:43:23,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:43:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:43:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:43:25,907][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:43:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:43:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:43:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:43:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:43:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:43:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:43:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:43:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:31,853][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:32,513][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:43:33,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:43:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:43:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:43:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:37,478][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:38,137][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:38,798][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:43:40,117][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:43:40,778][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:43:41,438][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:43:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:43:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:43:43,421][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:43:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:43:44,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:43:45,549][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:43:46,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:43:46,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:43:46,678][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:43:48,043][__main__][INFO] - Iteration 158 took 53s (11.68% Gen, 85.78% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 33m 2s. Estimated total time: 14h 55m 26s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 32s, 500 more iterations: 7h 27m 43s. [2026-03-25 16:43:48,046][__main__][INFO] - Starting iteration 158. [2026-03-25 16:43:48,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:43:48,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:53,750][__main__][INFO] - Number of regex retries in iteration 158: 0 [2026-03-25 16:43:53,752][__main__][INFO] - agents played in iteration 158 are Alice, Bob [2026-03-25 16:43:54,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:54,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:54,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:43:54,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:43:55,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:57,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:57,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:59,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:01,053][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:02,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:03,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:05,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:06,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:07,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:44:08,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:44:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:44:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:44:10,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:44:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:44:11,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:44:12,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:44:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:44:13,595][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:44:14,254][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:44:14,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:44:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:44:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:44:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:44:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:44:18,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:44:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:44:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:44:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:44:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:44:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:44:22,180][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:44:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:44:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:44:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:44:24,820][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:44:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:27,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:28,463][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:29,125][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:29,785][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:35,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:44:36,390][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:44:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:44:37,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:44:38,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:44:39,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:44:39,691][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:44:39,692][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:44:41,094][__main__][INFO] - Iteration 159 took 53s (10.75% Gen, 86.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 20m 49s. Estimated total time: 14h 44m 6s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 3s. [2026-03-25 16:44:41,097][__main__][INFO] - Starting iteration 159. [2026-03-25 16:44:41,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:44:41,101][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:44:45,919][__main__][INFO] - Number of regex retries in iteration 159: 0 [2026-03-25 16:44:45,920][__main__][INFO] - agents played in iteration 159 are Alice, Bob [2026-03-25 16:44:46,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:44:46,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:44:46,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:44:46,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:44:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:44:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:44:48,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:49,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:51,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:51,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:52,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:55,084][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:57,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:57,726][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:58,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:00,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:01,686][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:05,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:06,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:45:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:45:10,271][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:45:10,932][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:45:11,593][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:45:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:45:12,913][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:45:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:45:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:45:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:45:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:45:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:45:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:45:17,535][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:45:18,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:45:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:45:19,854][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:45:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:45:21,174][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:45:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:45:22,494][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:45:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:45:23,813][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:45:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:25,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:25,794][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:29,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:30,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:45:31,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:31,742][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:31,743][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:45:33,181][__main__][INFO] - Iteration 160 took 52s (9.25% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 3m 52s. Estimated total time: 14h 28m 1s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 0s. [2026-03-25 16:45:33,184][__main__][INFO] - Starting iteration 160. [2026-03-25 16:45:33,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:45:33,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:45:38,243][__main__][INFO] - Number of regex retries in iteration 160: 0 [2026-03-25 16:45:38,244][__main__][INFO] - agents played in iteration 160 are Alice, Bob [2026-03-25 16:45:38,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:45:38,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:45:38,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:45:38,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:45:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:45:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:45:40,871][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:45:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:45:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:45:42,855][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:45:43,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:45:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:45:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:45:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:45:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:45:46,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:45:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:45:48,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:45:48,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:45:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:45:50,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:45:50,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:55,401][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:58,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:00,696][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:01,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:03,995][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:07,294][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:09,275][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:10,594][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:11,611][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:12,271][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:46:12,930][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:46:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:46:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:46:14,908][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:46:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:46:16,228][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:46:16,887][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:46:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:46:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:46:18,866][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:46:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:46:20,185][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:46:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:22,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:22,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:46:24,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:24,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:24,029][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:25,459][__main__][INFO] - Iteration 161 took 52s (9.67% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 6m 10s. Estimated total time: 14h 31m 12s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 36s. [2026-03-25 16:46:25,461][__main__][INFO] - Starting iteration 161. [2026-03-25 16:46:25,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:46:25,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:46:30,261][__main__][INFO] - Number of regex retries in iteration 161: 0 [2026-03-25 16:46:30,263][__main__][INFO] - agents played in iteration 161 are Alice, Bob [2026-03-25 16:46:30,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:46:30,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:46:30,975][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:46:30,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:46:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:46:32,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:46:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:46:33,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:46:34,295][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:46:34,954][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:46:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:46:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:46:36,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:46:37,596][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:46:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:46:38,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:46:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:46:40,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:46:40,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:46:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:46:42,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:46:42,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:46:43,533][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:46:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:46:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:46:45,512][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:46:46,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:46:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:46:47,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:46:48,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:46:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:46:49,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:46:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:46:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:46:51,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:52,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:55,508][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:56,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:58,803][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:01,441][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:04,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:08,418][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:10,396][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:11,717][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:13,038][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:47:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:47:14,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:47:15,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:47:16,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:47:16,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:47:16,313][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:17,774][__main__][INFO] - Iteration 162 took 52s (9.17% Gen, 88.03% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 5m 56s. Estimated total time: 14h 31m 50s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 55s. [2026-03-25 16:47:17,782][__main__][INFO] - Starting iteration 162. [2026-03-25 16:47:17,796][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:47:17,797][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:23,038][__main__][INFO] - Number of regex retries in iteration 162: 0 [2026-03-25 16:47:23,039][__main__][INFO] - agents played in iteration 162 are Alice, Bob [2026-03-25 16:47:23,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:47:23,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:47:23,672][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:47:23,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:47:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:47:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:47:25,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:47:26,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:47:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:47:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:47:28,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:47:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:47:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:47:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:47:30,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:47:31,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:47:32,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:47:32,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:47:33,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:47:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:47:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:47:35,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:47:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:47:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:47:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:47:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:47:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:47:39,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:47:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:47:40,876][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:47:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:47:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:47:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:47:43,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:47:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:47:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:47:45,494][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:47:46,154][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:47:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:47:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:47:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:47:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:47:49,452][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:47:50,110][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:47:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:47:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:47:52,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:55,394][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:00,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:03,647][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:04,307][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:05,630][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:06,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:07,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:48:08,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:08,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:08,869][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:10,387][__main__][INFO] - Iteration 163 took 52s (9.97% Gen, 87.14% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 9m 46s. Estimated total time: 14h 36m 32s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 16s. [2026-03-25 16:48:10,389][__main__][INFO] - Starting iteration 163. [2026-03-25 16:48:10,394][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:48:10,395][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:48:16,926][__main__][INFO] - Number of regex retries in iteration 163: 0 [2026-03-25 16:48:16,927][__main__][INFO] - agents played in iteration 163 are Alice, Bob [2026-03-25 16:48:17,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:48:17,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:48:17,499][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:48:17,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:48:18,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:48:18,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:48:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:48:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:48:20,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:48:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:48:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:48:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:48:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:48:24,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:48:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:48:25,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:48:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:48:26,879][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:48:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:48:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:48:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:48:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:48:30,179][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:48:30,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:48:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:48:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:48:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:48:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:48:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:48:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:48:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:48:36,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:48:36,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:48:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:48:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:48:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:48:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:48:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:48:40,736][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:48:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:48:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:48:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:48:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:48:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:48:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:48:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:48:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:48:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:48:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:48:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:48:48,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:48:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:48:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:48:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:48:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:48:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:48:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:48:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:54,275][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:56,255][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:56,916][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:59,557][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:00,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:01,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:49:02,791][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:02,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:02,795][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:04,312][__main__][INFO] - Iteration 164 took 53s (12.11% Gen, 85.07% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 31m 0s. Estimated total time: 14h 58m 40s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 52s, 500 more iterations: 7h 29m 20s. [2026-03-25 16:49:04,315][__main__][INFO] - Starting iteration 164. [2026-03-25 16:49:04,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:04,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:09,188][__main__][INFO] - Number of regex retries in iteration 164: 0 [2026-03-25 16:49:09,189][__main__][INFO] - agents played in iteration 164 are Alice, Bob [2026-03-25 16:49:09,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:09,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:09,828][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:49:09,828][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:49:10,601][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:12,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:14,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:15,855][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:49:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:49:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:49:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:49:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:49:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:49:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:49:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:49:22,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:49:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:49:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:49:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:49:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:49:25,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:49:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:49:27,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:49:27,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:49:28,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:49:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:49:29,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:49:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:49:31,031][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:49:31,691][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:49:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:49:33,009][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:49:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:49:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:49:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:49:35,646][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:49:36,307][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:49:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:49:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:49:38,287][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:49:38,946][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:49:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:49:40,269][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:49:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:49:41,587][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:49:42,583][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:49:43,242][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:49:43,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:49:44,563][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:49:45,224][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:49:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:49:46,545][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:49:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:49:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:49:48,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:49:49,187][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:49:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:49:50,508][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:49:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:49:51,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:52,491][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:53,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:53,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:49:55,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:55,011][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:55,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:56,487][__main__][INFO] - Iteration 165 took 52s (9.33% Gen, 87.84% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 0m 57s. Estimated total time: 14h 29m 29s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 44s. [2026-03-25 16:49:56,489][__main__][INFO] - Starting iteration 165. [2026-03-25 16:49:56,493][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:56,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:03,019][__main__][INFO] - Number of regex retries in iteration 165: 0 [2026-03-25 16:50:03,020][__main__][INFO] - agents played in iteration 165 are Alice, Bob [2026-03-25 16:50:03,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:03,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:03,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:50:03,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:04,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:08,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:12,214][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:16,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:17,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:18,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:50:22,771][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:50:23,430][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:50:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:50:24,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:50:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:50:26,069][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:50:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:50:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:50:28,056][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:50:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:50:29,374][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:50:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:50:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:50:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:50:32,015][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:50:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:50:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:50:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:50:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:50:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:50:36,310][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:50:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:50:37,630][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:50:38,289][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:50:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:50:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:50:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:50:40,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:50:41,593][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:50:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:50:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:50:43,576][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:50:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:50:44,898][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:50:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:50:46,219][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:50:46,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:50:47,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:50:48,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:50:48,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:50:48,801][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:50:50,409][__main__][INFO] - Iteration 166 took 53s (12.11% Gen, 84.91% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 29m 10s. Estimated total time: 14h 58m 37s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 51s, 500 more iterations: 7h 29m 18s. [2026-03-25 16:50:50,412][__main__][INFO] - Starting iteration 166. [2026-03-25 16:50:50,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:50:50,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:55,266][__main__][INFO] - Number of regex retries in iteration 166: 0 [2026-03-25 16:50:55,269][__main__][INFO] - agents played in iteration 166 are Alice, Bob [2026-03-25 16:50:55,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:55,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:55,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:50:55,906][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:56,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:57,300][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:58,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:02,586][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:05,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:06,546][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:09,188][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:51:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:51:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:51:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:51:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:51:13,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:51:14,467][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:15,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:16,450][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:17,110][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:18,430][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:21,731][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:23,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:51:25,696][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:51:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:51:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:51:27,677][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:51:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:51:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:51:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:51:30,673][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:51:31,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:51:31,992][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:51:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:51:33,317][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:51:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:51:34,637][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:51:35,298][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:51:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:51:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:51:37,279][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:51:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:51:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:51:39,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:51:40,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:51:41,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:51:41,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:51:41,294][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:51:42,753][__main__][INFO] - Iteration 167 took 52s (9.27% Gen, 87.94% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 2m 0s. Estimated total time: 14h 32m 18s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 13s, 500 more iterations: 7h 16m 9s. [2026-03-25 16:51:42,756][__main__][INFO] - Starting iteration 167. [2026-03-25 16:51:42,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:51:42,761][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:51:47,791][__main__][INFO] - Number of regex retries in iteration 167: 0 [2026-03-25 16:51:47,792][__main__][INFO] - agents played in iteration 167 are Alice, Bob [2026-03-25 16:51:48,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:51:48,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:51:48,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:51:48,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:51:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:51:49,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:51:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:51:51,008][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:51:51,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:51:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:52,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:56,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:58,281][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:04,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:06,875][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:52:08,194][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:52:08,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:52:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:52:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:52:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:52:11,495][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:52:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:52:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:52:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:52:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:52:14,796][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:52:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:52:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:52:16,779][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:52:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:18,100][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:21,093][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:23,074][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:25,718][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:27,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:52:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:52:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:52:29,685][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:52:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:52:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:52:31,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:52:32,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:52:33,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:52:33,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:52:33,662][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:52:35,142][__main__][INFO] - Iteration 168 took 52s (9.60% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 1m 53s. Estimated total time: 14h 33m 4s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 32s. [2026-03-25 16:52:35,145][__main__][INFO] - Starting iteration 168. [2026-03-25 16:52:35,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:52:35,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:52:40,029][__main__][INFO] - Number of regex retries in iteration 168: 0 [2026-03-25 16:52:40,030][__main__][INFO] - agents played in iteration 168 are Alice, Bob [2026-03-25 16:52:40,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:52:40,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:52:40,672][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:52:40,673][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:52:41,551][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:52:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:52:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:52:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:52:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:52:44,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:52:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:52:46,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:52:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:52:47,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:52:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:52:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:52:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:52:50,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:52:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:52:51,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:52:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:53,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:54,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:55,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:57,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:58,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:01,301][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:02,620][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:03,281][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:03,940][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:05,261][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:05,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:53:06,584][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:53:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:53:07,910][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:53:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:53:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:53:09,892][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:53:10,553][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:53:11,214][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:53:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:53:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:53:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:53:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:53:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:53:15,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:53:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:53:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:53:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:53:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:53:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:53:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:53:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:20,818][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:22,139][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:24,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:24,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:53:26,039][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:26,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:26,044][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:27,441][__main__][INFO] - Iteration 169 took 52s (9.33% Gen, 87.99% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 59m 30s. Estimated total time: 14h 31m 33s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 46s. [2026-03-25 16:53:27,443][__main__][INFO] - Starting iteration 169. [2026-03-25 16:53:27,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:27,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:53:33,714][__main__][INFO] - Number of regex retries in iteration 169: 0 [2026-03-25 16:53:33,715][__main__][INFO] - agents played in iteration 169 are Alice, Bob [2026-03-25 16:53:34,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:53:34,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:53:34,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:53:34,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:53:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:53:35,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:53:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:53:37,046][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:53:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:53:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:53:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:53:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:53:40,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:53:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:53:44,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:53:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:53:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:53:46,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:53:46,956][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:53:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:53:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:53:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:53:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:53:50,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:53:50,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:53:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:53:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:53:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:55,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:56,195][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:56,856][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:58,838][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:59,498][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:02,796][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:54:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:54:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:54:06,093][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:54:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:54:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:54:08,402][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:54:09,061][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:54:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:54:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:54:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:54:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:54:12,362][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:54:13,023][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:54:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:54:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:54:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:54:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:54:16,325][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:54:16,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:54:17,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:54:18,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:54:19,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:54:19,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:54:19,774][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:54:21,193][__main__][INFO] - Iteration 170 took 53s (11.66% Gen, 85.69% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 22m 51s. Estimated total time: 14h 55m 48s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 54s. [2026-03-25 16:54:21,196][__main__][INFO] - Starting iteration 170. [2026-03-25 16:54:21,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:54:21,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:29,280][__main__][INFO] - Number of regex retries in iteration 170: 0 [2026-03-25 16:54:29,281][__main__][INFO] - agents played in iteration 170 are Alice, Bob [2026-03-25 16:54:29,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:54:29,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:54:29,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:54:29,981][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:54:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:32,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:32,732][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:34,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:35,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:54:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:54:36,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:54:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:54:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:54:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:54:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:54:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:54:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:54:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:54:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:54:42,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:54:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:54:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:54:44,616][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:54:45,276][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:54:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:54:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:54:47,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:54:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:54:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:54:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:54:49,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:54:50,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:54:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:54:51,871][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:53,192][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:53,853][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:57,154][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:04,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:55:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:55:06,728][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:55:07,388][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:55:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:55:08,706][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:55:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:55:10,026][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:55:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:55:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:55:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:55:12,666][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:55:13,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:55:14,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:55:15,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:55:15,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:55:15,295][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:55:16,791][__main__][INFO] - Iteration 171 took 55s (14.54% Gen, 82.77% Train). Generation: 8s, Training: 46s. Estimated remaining time: 12h 52m 40s. Estimated total time: 15h 26m 33s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 16s. [2026-03-25 16:55:16,794][__main__][INFO] - Starting iteration 171. [2026-03-25 16:55:16,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:55:16,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:55:23,783][__main__][INFO] - Number of regex retries in iteration 171: 0 [2026-03-25 16:55:23,784][__main__][INFO] - agents played in iteration 171 are Alice, Bob [2026-03-25 16:55:24,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:55:24,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:55:24,341][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:55:24,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:55:25,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:55:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:55:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:55:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:55:27,691][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:55:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:55:29,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:55:29,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:30,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:32,316][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:33,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:34,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:37,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:39,570][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:55:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:55:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:55:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:55:42,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:55:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:55:44,188][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:55:44,849][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:55:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:55:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:55:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:55:47,486][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:55:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:55:48,806][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:55:49,466][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:55:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:55:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:55:51,447][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:55:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:55:52,767][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:55:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:54,085][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:58,379][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:59,698][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:03,000][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:04,977][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:06,957][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:56:07,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:56:08,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:56:09,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:56:09,668][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:56:09,670][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:56:11,090][__main__][INFO] - Iteration 172 took 54s (12.87% Gen, 84.51% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 30m 7s. Estimated total time: 15h 4m 54s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 29s, 500 more iterations: 7h 32m 27s. [2026-03-25 16:56:11,093][__main__][INFO] - Starting iteration 172. [2026-03-25 16:56:11,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:56:11,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:56:16,222][__main__][INFO] - Number of regex retries in iteration 172: 0 [2026-03-25 16:56:16,223][__main__][INFO] - agents played in iteration 172 are Alice, Bob [2026-03-25 16:56:16,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:56:16,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:56:16,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:56:16,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:56:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:56:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:56:18,914][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:56:19,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:56:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:56:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:56:21,554][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:56:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:56:22,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:56:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:56:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:56:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:56:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:56:26,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:56:26,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:56:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:56:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:56:28,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:56:29,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:33,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:34,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:35,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:36,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:36,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:42,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:42,660][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:56:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:56:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:56:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:56:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:56:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:56:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:50,249][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:56:50,910][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:56:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:56:52,230][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:54,870][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:56,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:00,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:00,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:57:02,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:02,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:02,119][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:03,593][__main__][INFO] - Iteration 173 took 52s (9.76% Gen, 87.42% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 59m 18s. Estimated total time: 14h 34m 58s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 29s. [2026-03-25 16:57:03,596][__main__][INFO] - Starting iteration 173. [2026-03-25 16:57:03,600][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:03,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:57:09,387][__main__][INFO] - Number of regex retries in iteration 173: 0 [2026-03-25 16:57:09,388][__main__][INFO] - agents played in iteration 173 are Alice, Bob [2026-03-25 16:57:09,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:57:10,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:57:10,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:57:10,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:57:10,880][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:57:11,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:57:12,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:57:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:57:13,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:57:14,143][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:57:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:57:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:57:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:57:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:57:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:57:18,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:57:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:57:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:57:20,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:57:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:57:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:57:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:57:22,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:57:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:57:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:57:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:57:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:57:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:57:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:28,003][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:57:29,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:57:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:57:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:57:31,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:57:31,964][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:57:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:57:33,285][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:57:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:57:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:57:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:57:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:57:36,585][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:38,564][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:41,863][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:42,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:43,523][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:44,844][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:45,504][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:46,824][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:57:47,483][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:57:48,144][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:57:48,805][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:57:49,464][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:57:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:57:50,785][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:57:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:57:52,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:57:52,764][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:53,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:54,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:57:55,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:55,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:55,550][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:58,140][__main__][INFO] - Iteration 174 took 54s (10.61% Gen, 84.64% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 32m 27s. Estimated total time: 15h 9m 1s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 54s, 500 more iterations: 7h 34m 30s. [2026-03-25 16:57:58,143][__main__][INFO] - Starting iteration 174. [2026-03-25 16:57:58,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:58,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:02,981][__main__][INFO] - Number of regex retries in iteration 174: 0 [2026-03-25 16:58:02,983][__main__][INFO] - agents played in iteration 174 are Alice, Bob [2026-03-25 16:58:03,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:03,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:03,588][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:58:03,588][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:58:04,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:05,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:58:07,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:58:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:58:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:58:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:58:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:58:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:58:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:58:12,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:58:12,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:58:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:58:14,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:58:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:58:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:58:16,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:58:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:58:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:58:18,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:58:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:58:19,465][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:58:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:58:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:58:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:58:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:58:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:58:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:58:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:58:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:58:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:58:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:58:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:58:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:58:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:58:28,712][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:58:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:58:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:58:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:58:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:58:32,015][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:58:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:58:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:58:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:58:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:58:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:58:36,312][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:58:36,973][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:58:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:58:38,293][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:58:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:58:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:58:40,272][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:41,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:42,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:46,206][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:46,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:47,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:58:48,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:48,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:48,858][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:58:50,275][__main__][INFO] - Iteration 175 took 52s (9.28% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 51m 24s. Estimated total time: 14h 28m 50s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 25s. [2026-03-25 16:58:50,278][__main__][INFO] - Starting iteration 175. [2026-03-25 16:58:50,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:58:50,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:55,352][__main__][INFO] - Number of regex retries in iteration 175: 0 [2026-03-25 16:58:55,353][__main__][INFO] - agents played in iteration 175 are Alice, Bob [2026-03-25 16:58:55,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:56,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:56,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:58:56,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:58:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:57,382][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:58,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:00,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:01,344][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:03,323][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:06,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:59:07,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:59:08,610][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:59:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:59:09,930][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:59:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:59:11,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:59:11,912][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:59:12,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:59:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:59:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:59:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:15,213][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:16,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:17,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:59:18,516][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:59:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:59:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:59:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:59:21,155][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:59:21,815][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:59:22,477][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:59:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:59:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:59:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:59:25,118][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:59:25,778][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:27,758][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:29,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:30,073][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:31,392][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:59:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:59:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:59:33,374][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:59:34,034][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:59:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:59:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:59:36,013][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:59:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:59:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:59:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:59:38,660][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:59:39,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:59:40,135][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:59:41,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:59:41,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:59:41,321][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:42,671][__main__][INFO] - Iteration 176 took 52s (9.68% Gen, 87.74% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 54m 52s. Estimated total time: 14h 33m 11s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 35s. [2026-03-25 16:59:42,674][__main__][INFO] - Starting iteration 176. [2026-03-25 16:59:42,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:42,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:47,559][__main__][INFO] - Number of regex retries in iteration 176: 0 [2026-03-25 16:59:47,560][__main__][INFO] - agents played in iteration 176 are Alice, Bob [2026-03-25 16:59:48,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:59:48,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:59:48,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:59:48,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:59:48,827][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:59:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:59:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:59:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:52,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:52,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:54,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:55,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:56,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:58,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:00,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:01,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:03,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:05,951][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:00:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:00:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:00:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:00:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:00:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:00:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:00:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:00:15,191][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:00:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:00:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:00:17,171][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:00:17,831][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:00:18,490][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:00:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:00:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:00:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:00:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:00:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:00:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:00:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:27,408][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:00:29,392][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:00:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:00:30,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:00:31,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:00:32,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:00:33,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:00:33,409][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:00:33,410][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:00:34,681][__main__][INFO] - Iteration 177 took 52s (9.38% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 47m 34s. Estimated total time: 14h 26m 44s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 22s. [2026-03-25 17:00:34,684][__main__][INFO] - Starting iteration 177. [2026-03-25 17:00:34,688][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:00:34,688][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:00:39,541][__main__][INFO] - Number of regex retries in iteration 177: 0 [2026-03-25 17:00:39,542][__main__][INFO] - agents played in iteration 177 are Alice, Bob [2026-03-25 17:00:40,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:00:40,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:00:40,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:00:40,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:00:40,974][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:00:41,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:42,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:44,904][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:45,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:46,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:46,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:48,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:49,521][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:51,504][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:54,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:55,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:58,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:02,060][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:02,720][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:03,380][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:01:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:01:07,338][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:01:07,996][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:01:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:01:09,316][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:01:09,977][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:01:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:01:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:01:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:01:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:01:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:01:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:01:14,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:01:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:01:16,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:01:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:01:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:01:18,229][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:01:18,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:01:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:01:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:01:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:21,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:23,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:24,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:01:25,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:25,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:25,555][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:27,830][__main__][INFO] - Iteration 178 took 53s (9.13% Gen, 86.58% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 5m 40s. Estimated total time: 14h 45m 44s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 52s. [2026-03-25 17:01:27,832][__main__][INFO] - Starting iteration 178. [2026-03-25 17:01:27,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:27,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:01:33,430][__main__][INFO] - Number of regex retries in iteration 178: 0 [2026-03-25 17:01:33,431][__main__][INFO] - agents played in iteration 178 are Alice, Bob [2026-03-25 17:01:33,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:01:34,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:01:34,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:01:34,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:01:34,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:01:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:01:36,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:01:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:01:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:01:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:01:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:01:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:01:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:01:40,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:01:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:01:41,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:01:42,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:01:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:01:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:01:44,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:01:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:01:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:46,570][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:50,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:53,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:53,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:54,495][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:57,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:58,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:00,436][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:01,755][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:03,735][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:04,397][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:07,371][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:02:08,031][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:02:08,694][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:02:09,357][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:02:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:02:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:02:11,338][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:02:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:02:12,663][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:02:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:02:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:02:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:02:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:02:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:02:16,627][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:02:17,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:02:18,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:02:19,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:02:19,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:02:19,390][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:02:20,826][__main__][INFO] - Iteration 179 took 52s (10.56% Gen, 86.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 2m 15s. Estimated total time: 14h 43m 11s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 19s, 500 more iterations: 7h 21m 35s. [2026-03-25 17:02:20,829][__main__][INFO] - Starting iteration 179. [2026-03-25 17:02:20,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:02:20,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:02:25,541][__main__][INFO] - Number of regex retries in iteration 179: 0 [2026-03-25 17:02:25,542][__main__][INFO] - agents played in iteration 179 are Alice, Bob [2026-03-25 17:02:26,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:02:26,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:02:26,168][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:02:26,168][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:02:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:02:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:02:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:02:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:02:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:02:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:02:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:02:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:02:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:02:32,850][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:02:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:02:34,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:02:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:02:35,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:02:36,148][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:02:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:02:37,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:02:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:02:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:02:39,445][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:02:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:02:40,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:02:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:02:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:02:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:02:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:02:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:02:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:02:45,383][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:02:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:02:46,703][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:02:47,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:02:48,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:02:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:02:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:50,665][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:52,645][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:55,287][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:55,947][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:58,926][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:00,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:02,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:02,889][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:04,868][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:06,188][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:06,850][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:07,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:03:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:03:08,829][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:03:09,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:03:10,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:03:11,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:03:11,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:03:11,483][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:03:12,822][__main__][INFO] - Iteration 180 took 51s (9.06% Gen, 88.36% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 44m 41s. Estimated total time: 14h 26m 30s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 15s. [2026-03-25 17:03:12,824][__main__][INFO] - Starting iteration 180. [2026-03-25 17:03:12,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:03:12,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:17,649][__main__][INFO] - Number of regex retries in iteration 180: 0 [2026-03-25 17:03:17,650][__main__][INFO] - agents played in iteration 180 are Alice, Bob [2026-03-25 17:03:18,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:18,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:18,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:03:18,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:03:19,137][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:19,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:22,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:23,053][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:03:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:03:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:03:25,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:03:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:03:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:03:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:03:28,333][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:03:28,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:03:29,652][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:03:30,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:03:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:03:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:03:32,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:03:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:03:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:03:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:03:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:03:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:03:36,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:03:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:03:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:03:38,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:03:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:03:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:03:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:03:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:03:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:03:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:03:42,847][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:03:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:03:44,166][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:03:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:03:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:03:46,143][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:03:46,803][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:48,121][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:48,779][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:50,103][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:03:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:03:51,760][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:54,399][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:55,720][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:59,020][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:59,681][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:01,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:02,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:04:03,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:03,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:03,647][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:05,162][__main__][INFO] - Iteration 181 took 52s (9.21% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 49m 34s. Estimated total time: 14h 32m 15s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 13s, 500 more iterations: 7h 16m 7s. [2026-03-25 17:04:05,165][__main__][INFO] - Starting iteration 181. [2026-03-25 17:04:05,169][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:05,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:04:10,138][__main__][INFO] - Number of regex retries in iteration 181: 0 [2026-03-25 17:04:10,140][__main__][INFO] - agents played in iteration 181 are Alice, Bob [2026-03-25 17:04:10,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:04:10,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:04:10,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:04:10,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:04:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:04:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:04:12,711][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:04:13,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:04:14,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:04:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:04:15,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:04:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:16,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:04:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:04:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:04:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:04:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:04:25,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:04:26,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:04:27,218][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:04:27,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:04:28,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:04:29,198][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:04:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:04:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:04:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:04:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:04:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:04:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:04:33,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:04:34,483][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:04:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:04:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:04:36,462][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:04:37,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:04:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:04:38,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:04:39,101][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:04:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:04:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:04:41,078][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:04:41,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:04:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:44,050][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:44,710][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:45,369][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:48,005][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:49,984][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:51,304][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:51,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:52,620][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:53,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:53,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:54,870][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:04:56,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:56,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:56,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:57,431][__main__][INFO] - Iteration 182 took 52s (9.51% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 47m 30s. Estimated total time: 14h 31m 3s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 31s. [2026-03-25 17:04:57,434][__main__][INFO] - Starting iteration 182. [2026-03-25 17:04:57,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:57,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:02,641][__main__][INFO] - Number of regex retries in iteration 182: 0 [2026-03-25 17:05:02,642][__main__][INFO] - agents played in iteration 182 are Alice, Bob [2026-03-25 17:05:03,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:03,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:03,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:05:03,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:05:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:05,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:07,976][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:05:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:05:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:05:09,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:05:10,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:05:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:05:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:05:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:05:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:05:13,914][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:05:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:05:15,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:16,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:19,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:20,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:21,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:05:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:05:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:05:23,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:05:24,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:05:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:05:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:05:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:05:27,114][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:05:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:05:28,433][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:05:29,092][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:05:29,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:05:30,414][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:05:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:05:31,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:05:32,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:05:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:05:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:05:34,369][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:05:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:05:36,024][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:05:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:05:37,345][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:05:38,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:05:38,663][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:05:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:41,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:41,965][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:05:43,946][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:05:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:05:45,264][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:05:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:05:46,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:47,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:05:48,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:48,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:48,483][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:49,810][__main__][INFO] - Iteration 183 took 52s (9.93% Gen, 87.53% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 48m 28s. Estimated total time: 14h 32m 54s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 27s. [2026-03-25 17:05:49,818][__main__][INFO] - Starting iteration 183. [2026-03-25 17:05:49,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:49,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:54,700][__main__][INFO] - Number of regex retries in iteration 183: 0 [2026-03-25 17:05:54,701][__main__][INFO] - agents played in iteration 183 are Alice, Bob [2026-03-25 17:05:55,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:55,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:55,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:05:55,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:05:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:59,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:00,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:02,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:03,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:03,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:04,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:05,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:06:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:06:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:06:07,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:06:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:06:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:06:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:06:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:06:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:06:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:06:11,861][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:06:12,521][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:06:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:06:13,838][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:16,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:17,794][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:19,770][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:06:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:06:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:06:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:06:22,407][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:06:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:06:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:06:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:06:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:06:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:06:26,369][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:06:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:06:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:06:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:06:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:06:30,000][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:06:30,657][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:06:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:06:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:06:32,635][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:06:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:06:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:06:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:06:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:06:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:06:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:06:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:06:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:06:38,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:06:39,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:06:40,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:06:40,571][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:06:42,828][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:06:44,126][__main__][INFO] - Iteration 184 took 54s (8.98% Gen, 88.62% Train). Generation: 4s, Training: 48s. Estimated remaining time: 12h 19m 45s. Estimated total time: 15h 5m 5s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 32s. [2026-03-25 17:06:44,128][__main__][INFO] - Starting iteration 184. [2026-03-25 17:06:44,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:06:44,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:06:49,022][__main__][INFO] - Number of regex retries in iteration 184: 0 [2026-03-25 17:06:49,023][__main__][INFO] - agents played in iteration 184 are Alice, Bob [2026-03-25 17:06:49,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:06:49,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:06:49,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:06:49,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:06:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:50,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:51,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:52,215][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:54,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:54,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:56,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:56,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:57,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:58,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:58,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:00,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:00,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:01,447][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:02,106][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:02,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:04,083][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:06,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:08,035][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:09,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:10,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:07:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:07:11,991][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:07:12,650][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:07:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:07:13,970][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:17,265][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:17,924][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:19,901][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:07:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:07:21,220][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:07:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:07:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:07:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:07:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:07:24,852][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:07:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:07:26,171][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:07:26,832][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:07:27,490][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:07:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:07:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:07:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:32,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:33,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:07:34,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:07:34,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:07:34,721][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:07:36,150][__main__][INFO] - Iteration 185 took 52s (9.40% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 40m 46s. Estimated total time: 14h 26m 58s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 29s. [2026-03-25 17:07:36,153][__main__][INFO] - Starting iteration 185. [2026-03-25 17:07:36,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:07:36,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:07:41,050][__main__][INFO] - Number of regex retries in iteration 185: 0 [2026-03-25 17:07:41,051][__main__][INFO] - agents played in iteration 185 are Alice, Bob [2026-03-25 17:07:41,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:07:41,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:07:41,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:07:41,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:07:42,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:07:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:07:43,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:07:44,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:07:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:07:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:07:46,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:07:47,022][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:07:47,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:07:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:07:49,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:07:49,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:07:50,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:07:50,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:07:51,643][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:53,625][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:55,603][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:56,920][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:57,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:59,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:00,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:01,537][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:04,173][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:04,832][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:05,491][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:07,469][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:08,128][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:08,787][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:10,105][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:11,424][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:12,084][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:15,046][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:15,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:08:16,366][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:08:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:08:17,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:08:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:08:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:08:19,662][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:08:20,323][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:08:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:08:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:08:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:08:22,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:08:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:08:24,279][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:08:24,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:08:25,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:08:27,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:27,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:27,009][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:08:28,578][__main__][INFO] - Iteration 186 took 52s (9.33% Gen, 87.67% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 46m 38s. Estimated total time: 14h 33m 42s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 51s. [2026-03-25 17:08:28,580][__main__][INFO] - Starting iteration 186. [2026-03-25 17:08:28,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:08:28,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:08:33,384][__main__][INFO] - Number of regex retries in iteration 186: 0 [2026-03-25 17:08:33,387][__main__][INFO] - agents played in iteration 186 are Alice, Bob [2026-03-25 17:08:33,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:08:33,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:08:33,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:08:33,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:08:34,622][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:08:35,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:08:35,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:08:36,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:08:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:08:37,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:08:38,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:08:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:08:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:08:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:08:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:08:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:08:42,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:08:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:08:43,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:08:44,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:08:45,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:08:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:08:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:08:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:08:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:08:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:08:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:08:49,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:08:50,413][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:08:51,073][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:08:51,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:55,697][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:56,355][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:57,014][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:00,975][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:01,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:02,296][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:04,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:05,594][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:08,558][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:09,878][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:10,538][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:11,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:11,856][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:13,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:15,154][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:15,814][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:17,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:09:17,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:09:19,119][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:09:19,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:09:19,123][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:20,453][__main__][INFO] - Iteration 187 took 51s (9.26% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 36m 33s. Estimated total time: 14h 24m 30s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 15s. [2026-03-25 17:09:20,455][__main__][INFO] - Starting iteration 187. [2026-03-25 17:09:20,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:20,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:09:25,191][__main__][INFO] - Number of regex retries in iteration 187: 0 [2026-03-25 17:09:25,192][__main__][INFO] - agents played in iteration 187 are Alice, Bob [2026-03-25 17:09:25,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:09:25,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:09:25,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:09:25,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:09:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:09:27,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:09:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:09:28,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:09:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:09:29,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:09:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:09:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:09:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:09:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:09:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:09:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:09:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:09:35,124][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:09:35,784][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:09:36,443][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:09:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:09:37,762][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:09:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:09:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:09:39,739][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:09:40,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:09:41,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:09:41,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:09:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:09:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:09:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:09:44,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:09:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:09:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:09:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:09:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:09:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:09:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:09:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:09:49,637][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:09:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:09:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:09:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:52,276][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:52,936][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:56,895][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:00,536][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:02,513][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:04,493][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:06,471][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:07,131][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:07,790][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:08,449][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:09,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:09,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:10:10,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:10,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:10,967][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:12,435][__main__][INFO] - Iteration 188 took 51s (9.11% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 37m 29s. Estimated total time: 14h 26m 17s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 8s. [2026-03-25 17:10:12,437][__main__][INFO] - Starting iteration 188. [2026-03-25 17:10:12,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:10:12,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:17,246][__main__][INFO] - Number of regex retries in iteration 188: 0 [2026-03-25 17:10:17,247][__main__][INFO] - agents played in iteration 188 are Alice, Bob [2026-03-25 17:10:17,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:10:17,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:10:17,782][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:10:17,782][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:10:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:10:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:10:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:10:21,172][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:10:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:10:22,489][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:10:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:10:23,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:10:24,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:10:25,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:10:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:10:26,446][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:10:27,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:10:27,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:10:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:10:29,086][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:10:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:10:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:10:31,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:10:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:10:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:10:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:10:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:10:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:10:35,023][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:10:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:10:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:10:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:10:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:10:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:10:38,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:10:39,639][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:10:40,298][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:10:40,956][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:10:41,615][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:10:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:10:42,933][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:10:43,592][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:10:44,252][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:10:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:10:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:10:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:10:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:10:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:10:48,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:10:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:10:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:10:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:10:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:10:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:52,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:54,481][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:57,122][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:59,764][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:01,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:01,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:11:03,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:03,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:03,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:04,643][__main__][INFO] - Iteration 189 took 52s (9.20% Gen, 87.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 40m 23s. Estimated total time: 14h 30m 3s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 1s. [2026-03-25 17:11:04,645][__main__][INFO] - Starting iteration 189. [2026-03-25 17:11:04,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:11:04,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:08,090][mllm.models.large_language_model_local][WARNING] - Response |), retry 1/1 [2026-03-25 17:11:09,494][__main__][INFO] - Number of regex retries in iteration 189: 1 [2026-03-25 17:11:09,495][__main__][INFO] - agents played in iteration 189 are Alice, Bob [2026-03-25 17:11:10,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:11:10,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:11:10,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:11:10,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:11:10,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:13,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:16,779][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:17,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:18,099][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:11:20,079][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:11:20,740][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:11:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:11:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:11:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:11:23,378][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:11:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:11:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:11:25,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:11:26,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:11:26,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:11:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:11:27,997][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:11:28,657][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:11:29,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:11:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:11:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:11:31,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:11:31,954][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:11:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:11:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:11:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:11:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:11:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:11:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:11:36,572][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:11:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:11:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:11:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:11:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:11:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:11:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:11:41,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:11:41,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:11:42,842][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:11:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:11:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:11:44,820][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:11:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:11:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:11:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:11:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:11:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:11:48,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:11:49,438][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:11:50,097][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:11:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:11:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:11:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:52,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:53,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:54,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:11:55,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:55,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:55,451][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:56,639][__main__][INFO] - Iteration 190 took 51s (9.32% Gen, 88.39% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 35m 59s. Estimated total time: 14h 26m 31s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 15s. [2026-03-25 17:11:56,642][__main__][INFO] - Starting iteration 190. [2026-03-25 17:11:56,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:11:56,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:01,545][__main__][INFO] - Number of regex retries in iteration 190: 0 [2026-03-25 17:12:01,547][__main__][INFO] - agents played in iteration 190 are Alice, Bob [2026-03-25 17:12:02,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:02,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:02,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:12:02,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:12:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:03,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:04,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:04,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:05,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:06,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:12:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:10,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:12,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:12,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:13,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:14,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:14,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:16,108][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:16,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:18,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:18,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:22,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:12:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:12:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:12:24,675][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:12:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:12:25,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:12:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:12:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:12:27,973][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:12:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:12:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:12:29,951][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:12:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:12:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:12:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:12:32,587][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:12:33,247][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:12:33,906][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:12:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:12:35,561][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:12:36,221][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:12:36,880][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:12:37,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:12:38,200][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:12:38,860][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:12:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:12:40,180][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:12:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:12:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:12:42,157][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:12:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:12:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:12:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:12:44,795][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:12:45,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:12:46,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:12:47,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:12:47,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:12:47,440][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:12:48,676][__main__][INFO] - Iteration 191 took 52s (9.42% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 35m 47s. Estimated total time: 14h 27m 12s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 36s. [2026-03-25 17:12:48,679][__main__][INFO] - Starting iteration 191. [2026-03-25 17:12:48,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:12:48,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:53,470][__main__][INFO] - Number of regex retries in iteration 191: 0 [2026-03-25 17:12:53,471][__main__][INFO] - agents played in iteration 191 are Alice, Bob [2026-03-25 17:12:54,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:54,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:54,067][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:12:54,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:12:54,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:56,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:57,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:58,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:00,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:03,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:04,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:13:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:13:08,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:13:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:13:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:13:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:13:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:13:11,973][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:13:12,632][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:13:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:13:13,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:13:14,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:18,569][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:19,888][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:20,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:21,206][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:21,865][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:22,524][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:23,184][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:25,162][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:25,822][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:13:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:13:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:13:28,803][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:13:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:13:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:13:30,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:13:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:13:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:13:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:13:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:13:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:13:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:13:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:13:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:13:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:13:37,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:13:38,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:13:39,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:13:39,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:13:39,425][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:13:40,852][__main__][INFO] - Iteration 192 took 52s (9.18% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 37m 14s. Estimated total time: 14h 29m 31s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 45s. [2026-03-25 17:13:40,855][__main__][INFO] - Starting iteration 192. [2026-03-25 17:13:40,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:13:40,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:13:45,698][__main__][INFO] - Number of regex retries in iteration 192: 0 [2026-03-25 17:13:45,699][__main__][INFO] - agents played in iteration 192 are Alice, Bob [2026-03-25 17:13:46,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:13:46,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:13:46,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:13:46,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:13:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:13:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:13:48,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:13:49,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:13:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:13:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:13:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:13:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:53,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:55,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:56,323][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:57,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:00,279][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:02,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:03,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:04,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:06,871][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:14:08,191][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:14:08,851][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:14:09,511][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:14:10,170][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:14:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:14:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:14:12,147][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:14:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:14:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:14:14,126][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:14:14,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:14:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:14:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:14:16,766][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:14:17,428][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:14:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:14:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:19,754][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:22,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:23,049][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:26,346][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:27,006][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:28,327][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:29,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:14:30,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:14:31,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:14:31,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:14:31,661][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:14:32,977][__main__][INFO] - Iteration 193 took 52s (9.28% Gen, 88.18% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 35m 31s. Estimated total time: 14h 28m 40s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 20s. [2026-03-25 17:14:32,979][__main__][INFO] - Starting iteration 193. [2026-03-25 17:14:32,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:14:32,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:14:37,755][__main__][INFO] - Number of regex retries in iteration 193: 0 [2026-03-25 17:14:37,756][__main__][INFO] - agents played in iteration 193 are Alice, Bob [2026-03-25 17:14:38,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:14:38,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:14:38,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:14:38,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:14:39,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:14:39,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:14:40,290][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:14:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:14:41,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:14:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:14:42,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:14:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:14:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:14:44,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:14:45,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:14:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:14:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:47,544][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:48,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:49,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:50,179][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:51,499][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:52,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:53,476][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:58,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:59,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:01,394][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:03,371][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:04,690][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:15:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:15:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:15:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:15:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:15:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:15:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:15:12,276][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:15:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:15:13,595][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:15:14,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:15:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:15:15,576][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:15:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:15:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:15:17,554][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:15:18,214][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:15:18,873][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:15:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:15:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:15:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:15:21,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:22,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:15:23,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:23,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:23,412][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:24,748][__main__][INFO] - Iteration 194 took 51s (9.22% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 28m 46s. Estimated total time: 14h 22m 47s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 23s. [2026-03-25 17:15:24,777][__main__][INFO] - Starting iteration 194. [2026-03-25 17:15:24,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:15:24,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:15:29,913][__main__][INFO] - Number of regex retries in iteration 194: 0 [2026-03-25 17:15:29,914][__main__][INFO] - agents played in iteration 194 are Alice, Bob [2026-03-25 17:15:30,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:15:30,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:15:30,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:15:30,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:15:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:15:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:15:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:15:33,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:15:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:15:34,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:15:35,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:15:35,726][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:15:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:15:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:15:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:15:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:15:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:15:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:15:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:15:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:15:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:15:42,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:15:42,976][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:15:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:15:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:46,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:49,573][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:50,893][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:52,214][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:54,853][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:56,831][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:57,492][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:01,453][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:03,811][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:04,473][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:05,133][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:05,796][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:07,123][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:16:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:16:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:16:09,107][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:16:09,770][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:16:10,434][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:16:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:16:11,754][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:16:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:16:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:16:13,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:16:14,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:16:15,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:16:15,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:16:15,998][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:16:17,273][__main__][INFO] - Iteration 195 took 52s (9.78% Gen, 87.79% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 40m 0s. Estimated total time: 14h 34m 54s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 27s. [2026-03-25 17:16:17,288][__main__][INFO] - Starting iteration 195. [2026-03-25 17:16:17,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:16:17,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:22,185][__main__][INFO] - Number of regex retries in iteration 195: 0 [2026-03-25 17:16:22,187][__main__][INFO] - agents played in iteration 195 are Alice, Bob [2026-03-25 17:16:22,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:16:22,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:16:22,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:16:22,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:16:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:24,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:26,206][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:26,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:28,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:28,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:16:31,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:16:32,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:16:32,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:16:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:16:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:16:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:16:35,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:16:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:16:36,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:16:37,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:16:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:16:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:16:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:16:40,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:16:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:16:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:16:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:16:42,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:16:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:16:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:16:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:47,336][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:16:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:16:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:16:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:16:50,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:16:51,293][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:16:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:52,610][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:56,235][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:58,214][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:58,875][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:59,534][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:00,853][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:02,171][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:03,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:04,151][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:04,810][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:05,468][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:06,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:06,929][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:17:08,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:17:08,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:17:08,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:17:09,411][__main__][INFO] - Iteration 196 took 52s (9.39% Gen, 88.14% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 32m 54s. Estimated total time: 14h 28m 39s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 19s. [2026-03-25 17:17:09,413][__main__][INFO] - Starting iteration 196. [2026-03-25 17:17:09,417][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:17:09,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:17:14,231][__main__][INFO] - Number of regex retries in iteration 196: 0 [2026-03-25 17:17:14,232][__main__][INFO] - agents played in iteration 196 are Alice, Bob [2026-03-25 17:17:14,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:17:14,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:17:14,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:17:14,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:17:15,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:17:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:17:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:17:17,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:17:18,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:17:18,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:17:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:17:20,148][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:17:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:17:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:17:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:17:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:23,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:24,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:24,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:27,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:30,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:17:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:17:32,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:17:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:17:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:17:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:17:35,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:17:35,980][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:36,640][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:37,959][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:38,618][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:39,277][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:40,599][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:43,896][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:44,556][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:17:45,215][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:17:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:17:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:17:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:17:48,190][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:17:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:17:49,508][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:17:50,168][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:17:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:17:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:53,467][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:58,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:59,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:18:00,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:00,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:00,423][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:02,069][__main__][INFO] - Iteration 197 took 52s (9.14% Gen, 87.73% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 40m 56s. Estimated total time: 14h 37m 34s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 47s. [2026-03-25 17:18:02,072][__main__][INFO] - Starting iteration 197. [2026-03-25 17:18:02,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:18:02,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:07,101][__main__][INFO] - Number of regex retries in iteration 197: 0 [2026-03-25 17:18:07,102][__main__][INFO] - agents played in iteration 197 are Alice, Bob [2026-03-25 17:18:07,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:07,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:07,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:18:07,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:18:08,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:18:09,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:18:09,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:18:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:18:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:18:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:18:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:18:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:18:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:18:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:18:14,987][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:18:15,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:18:16,307][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:18:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:18:17,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:18:18,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:18:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:18:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:18:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:18:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:18:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:18:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:18:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:18:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:18:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:24,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:26,203][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:26,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:28,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:30,160][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:31,480][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:32,140][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:18:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:38,735][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:39,395][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:18:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:18:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:18:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:18:43,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:18:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:18:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:18:45,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:18:46,327][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:18:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:18:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:18:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:18:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:18:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:18:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:18:50,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:18:51,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:18:53,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:53,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:53,108][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:54,507][__main__][INFO] - Iteration 198 took 52s (9.58% Gen, 87.74% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 36m 22s. Estimated total time: 14h 33m 52s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 56s. [2026-03-25 17:18:54,509][__main__][INFO] - Starting iteration 198. [2026-03-25 17:18:54,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:18:54,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:59,346][__main__][INFO] - Number of regex retries in iteration 198: 0 [2026-03-25 17:18:59,347][__main__][INFO] - agents played in iteration 198 are Alice, Bob [2026-03-25 17:18:59,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:59,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:59,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:18:59,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:19:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:01,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:04,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:04,822][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:05,479][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:19:07,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:19:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:19:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:19:09,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:19:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:19:10,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:19:11,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:19:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:19:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:19:13,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:19:14,048][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:19:14,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:19:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:19:16,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:19:16,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:19:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:19:18,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:19:18,667][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:19:19,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:19:19,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:19:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:19:21,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:19:21,964][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:19:22,622][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:19:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:19:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:19:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:19:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:19:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:19:26,579][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:27,239][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:27,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:29,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:34,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:36,811][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:38,788][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:19:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:19:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:19:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:19:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:19:42,742][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:19:43,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:19:44,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:19:45,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:19:45,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:19:45,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:19:47,058][__main__][INFO] - Iteration 199 took 52s (9.20% Gen, 88.27% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 37m 23s. Estimated total time: 14h 35m 46s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 53s. [2026-03-25 17:19:47,063][__main__][INFO] - Starting iteration 199. [2026-03-25 17:19:47,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:19:47,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:19:51,947][__main__][INFO] - Number of regex retries in iteration 199: 0 [2026-03-25 17:19:51,948][__main__][INFO] - agents played in iteration 199 are Alice, Bob [2026-03-25 17:19:52,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:19:52,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:19:52,606][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:19:52,607][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:19:53,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:54,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:54,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:56,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:57,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:00,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:01,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:03,310][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:20:07,931][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:20:08,590][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:20:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:20:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:20:10,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:20:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:20:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:20:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:20:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:20:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:20:14,528][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:20:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:20:15,848][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:20:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:20:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:20:17,825][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:20:18,485][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:20:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:20:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:20:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:20:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:20:21,783][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:20:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:20:23,102][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:20:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:20:24,422][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:20:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:20:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:20:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:20:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:20:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:20:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:20:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:20:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:32,006][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:35,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:36,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:20:37,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:37,999][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:38,000][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:39,231][__main__][INFO] - Iteration 200 took 52s (9.30% Gen, 88.34% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 29m 38s. Estimated total time: 14h 28m 53s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 26s. [2026-03-25 17:20:39,234][__main__][INFO] - Starting iteration 200. [2026-03-25 17:20:39,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:20:39,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:20:44,408][__main__][INFO] - Number of regex retries in iteration 200: 0 [2026-03-25 17:20:44,412][__main__][INFO] - agents played in iteration 200 are Alice, Bob [2026-03-25 17:20:44,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:20:45,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:20:45,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:20:45,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:20:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:20:46,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:20:47,019][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:20:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:20:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:20:49,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:20:49,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:20:50,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:20:50,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:20:51,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:52,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:54,276][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:55,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:56,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:57,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:58,230][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:58,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:00,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:00,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:01,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:02,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:21:06,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:21:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:21:07,461][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:21:08,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:21:08,782][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:21:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:21:10,101][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:21:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:21:11,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:21:12,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:21:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:21:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:21:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:21:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:21:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:21:16,039][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:21:16,699][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:21:17,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:21:18,350][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:21:19,011][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:21:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:21:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:21:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:21:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:21:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:21:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:21:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:21:24,289][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:21:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:21:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:21:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:26,928][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:21:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:21:28,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:21:29,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:21:30,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:21:30,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:21:30,260][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:32,809][__main__][INFO] - Iteration 201 took 53s (9.66% Gen, 85.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 52m 45s. Estimated total time: 14h 52m 54s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 17s, 500 more iterations: 7h 26m 27s. [2026-03-25 17:21:32,812][__main__][INFO] - Starting iteration 201. [2026-03-25 17:21:32,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:21:32,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:37,629][__main__][INFO] - Number of regex retries in iteration 201: 0 [2026-03-25 17:21:37,630][__main__][INFO] - agents played in iteration 201 are Alice, Bob [2026-03-25 17:21:38,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:21:38,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:21:38,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:21:38,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:21:39,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:40,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:40,881][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:42,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:21:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:21:44,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:21:44,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:21:45,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:21:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:21:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:21:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:21:48,141][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:21:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:21:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:21:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:21:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:21:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:21:52,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:21:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:21:53,421][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:54,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:57,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:58,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:59,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:00,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:00,678][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:01,338][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:02,002][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:22:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:22:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:22:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:22:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:22:10,581][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:22:11,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:22:12,251][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:22:12,911][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:22:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:22:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:22:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:22:15,552][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:22:16,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:22:16,874][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:22:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:22:18,194][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:22:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:22:19,516][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:22:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:22:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:22,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:22,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:22:24,494][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:24,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:24,508][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:22:25,861][__main__][INFO] - Iteration 202 took 53s (9.07% Gen, 88.37% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 43m 4s. Estimated total time: 14h 44m 6s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 3s. [2026-03-25 17:22:25,863][__main__][INFO] - Starting iteration 202. [2026-03-25 17:22:25,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:22:25,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:30,667][__main__][INFO] - Number of regex retries in iteration 202: 0 [2026-03-25 17:22:30,668][__main__][INFO] - agents played in iteration 202 are Alice, Bob [2026-03-25 17:22:31,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:31,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:31,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:22:31,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:22:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:22:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:36,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:38,611][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:39,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:39,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:40,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:42,570][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:43,229][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:43,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:22:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:22:47,185][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:22:47,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:22:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:22:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:22:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:22:50,481][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:22:51,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:22:51,798][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:54,433][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:55,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:55,751][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:56,410][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:57,728][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:59,704][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:00,362][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:04,001][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:23:04,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:23:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:23:05,978][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:23:06,637][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:23:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:23:07,956][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:23:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:23:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:23:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:23:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:23:11,252][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:23:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:23:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:23:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:23:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:23:14,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:23:15,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:23:16,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:23:16,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:23:16,659][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:17,969][__main__][INFO] - Iteration 203 took 52s (9.21% Gen, 88.27% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 26m 29s. Estimated total time: 14h 28m 23s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 11s. [2026-03-25 17:23:17,972][__main__][INFO] - Starting iteration 203. [2026-03-25 17:23:17,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:23:17,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:23:24,762][__main__][INFO] - Number of regex retries in iteration 203: 0 [2026-03-25 17:23:24,763][__main__][INFO] - agents played in iteration 203 are Alice, Bob [2026-03-25 17:23:25,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:23:25,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:23:25,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:23:25,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:23:26,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:23:26,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:23:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:23:28,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:23:28,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:29,425][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:31,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:33,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:34,049][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:35,366][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:23:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:23:36,683][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:23:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:23:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:23:38,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:23:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:23:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:45,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:50,340][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:51,659][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:23:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:23:54,295][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:23:54,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:23:55,613][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:56,930][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:59,903][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:02,537][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:03,196][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:03,855][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:05,832][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:06,492][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:24:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:24:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:24:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:24:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:24:10,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:24:11,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 17:24:12,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:24:12,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:24:12,412][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:24:13,855][__main__][INFO] - Iteration 204 took 55s (12.14% Gen, 85.27% Train). Generation: 6s, Training: 47s. Estimated remaining time: 12h 28m 31s. Estimated total time: 15h 31m 21s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 40s. [2026-03-25 17:24:13,858][__main__][INFO] - Starting iteration 204. [2026-03-25 17:24:13,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:24:13,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:24:19,282][__main__][INFO] - Number of regex retries in iteration 204: 0 [2026-03-25 17:24:19,283][__main__][INFO] - agents played in iteration 204 are Alice, Bob [2026-03-25 17:24:19,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:24:19,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:24:19,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:24:19,919][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:24:20,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:24:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:24:21,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:24:22,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:24:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:24:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:24:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:24:25,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:24:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:24:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:24:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:24:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:24:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:24:29,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:24:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:32,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:35,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:35,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:36,447][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:37,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:24:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:24:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:24:39,749][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:24:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:24:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:24:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:24:42,388][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:24:43,048][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:24:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:24:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:24:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:24:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:46,347][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:47,007][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:47,666][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:48,985][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:50,961][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:51,620][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:52,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:55,262][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:57,901][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:58,560][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:59,219][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:59,880][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:00,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:01,203][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:03,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:04,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:25:05,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:05,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:05,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:06,903][__main__][INFO] - Iteration 205 took 53s (10.22% Gen, 86.98% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 40m 20s. Estimated total time: 14h 44m 3s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 1s. [2026-03-25 17:25:06,905][__main__][INFO] - Starting iteration 205. [2026-03-25 17:25:06,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:06,910][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:12,139][__main__][INFO] - Number of regex retries in iteration 205: 0 [2026-03-25 17:25:12,140][__main__][INFO] - agents played in iteration 205 are Alice, Bob [2026-03-25 17:25:12,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:25:12,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:25:12,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:25:12,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:25:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:25:13,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:25:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:25:15,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:25:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:25:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:25:17,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:25:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:25:18,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:25:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:25:19,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:25:20,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:25:21,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:25:21,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:25:22,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:25:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:25:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:25:24,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:25:25,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:25:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:25:26,502][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:25:27,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:25:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:25:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:25:29,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:25:29,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:25:30,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:31,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:35,733][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:37,708][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:25:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:25:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:25:40,343][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:25:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:25:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:25:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:25:42,981][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:25:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:25:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:25:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:25:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:25:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:25:47,269][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:25:47,928][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:25:48,588][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:25:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:49,906][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:51,882][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:52,540][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:55,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:56,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:25:57,794][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:57,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:57,802][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:59,108][__main__][INFO] - Iteration 206 took 52s (10.02% Gen, 87.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 25m 25s. Estimated total time: 14h 30m 0s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 0s. [2026-03-25 17:25:59,111][__main__][INFO] - Starting iteration 206. [2026-03-25 17:25:59,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:59,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:04,044][__main__][INFO] - Number of regex retries in iteration 206: 0 [2026-03-25 17:26:04,046][__main__][INFO] - agents played in iteration 206 are Alice, Bob [2026-03-25 17:26:04,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:04,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:04,665][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:26:04,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:26:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:26:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:26:09,903][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:26:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:26:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:26:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:26:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:26:13,196][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:26:13,857][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:26:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:26:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:26:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:26:16,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:26:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:26:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:26:18,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:26:19,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:26:19,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:26:20,456][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:26:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:26:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:26:22,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:26:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:26:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:26:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:26:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:26:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:26:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:26:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:26:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:26:28,371][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:26:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:26:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:26:30,348][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:31,666][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:32,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:32,984][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:33,643][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:34,303][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:34,963][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:26:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:26:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:26:37,937][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:26:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:26:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:26:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:26:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:26:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:26:41,892][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:26:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:26:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:26:43,870][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:26:44,529][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:26:45,188][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:26:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:26:46,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:26:47,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:26:47,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:26:48,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:26:49,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:26:49,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:26:49,814][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:26:51,239][__main__][INFO] - Iteration 207 took 52s (9.46% Gen, 87.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 23m 19s. Estimated total time: 14h 28m 46s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 23s. [2026-03-25 17:26:51,242][__main__][INFO] - Starting iteration 207. [2026-03-25 17:26:51,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:26:51,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:56,460][__main__][INFO] - Number of regex retries in iteration 207: 0 [2026-03-25 17:26:56,461][__main__][INFO] - agents played in iteration 207 are Alice, Bob [2026-03-25 17:26:56,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:57,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:57,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:26:57,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:26:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:58,428][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:59,090][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:06,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:07,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:10,289][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:27:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:27:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:27:13,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:27:14,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:27:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:27:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:27:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:27:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:27:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:27:18,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:27:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:27:19,516][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:27:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:27:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:27:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:27:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:27:22,819][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:27:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:27:24,138][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:27:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:27:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:27:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:27:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:27:27,432][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:27:28,090][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:28,748][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:31,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:27:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:27:35,022][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:27:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:27:36,340][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:27:36,999][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:27:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:27:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:27:38,976][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:27:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:27:40,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:27:41,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:27:42,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:27:42,287][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:27:42,288][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:27:43,668][__main__][INFO] - Iteration 208 took 52s (9.94% Gen, 87.42% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 27m 23s. Estimated total time: 14h 33m 43s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 51s. [2026-03-25 17:27:43,671][__main__][INFO] - Starting iteration 208. [2026-03-25 17:27:43,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:27:43,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:27:48,962][__main__][INFO] - Number of regex retries in iteration 208: 0 [2026-03-25 17:27:48,963][__main__][INFO] - agents played in iteration 208 are Alice, Bob [2026-03-25 17:27:49,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:27:49,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:27:49,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:27:49,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:27:50,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:27:50,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:27:51,580][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:27:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:53,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:54,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:56,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:56,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:57,507][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:59,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:02,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:04,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:05,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:06,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:10,698][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:11,357][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:12,016][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:12,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:13,333][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:13,992][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:14,650][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:15,308][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:28:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:28:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:28:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:28:18,600][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:28:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:28:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:28:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:28:21,237][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:28:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:28:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:28:23,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:28:24,212][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:28:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:28:25,530][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:28:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:26,850][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:27,509][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:28,167][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:28,825][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:30,802][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:32,119][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:32,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:33,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:28:35,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:28:35,226][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:28:35,227][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:28:36,614][__main__][INFO] - Iteration 209 took 52s (9.99% Gen, 87.39% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 35m 8s. Estimated total time: 14h 42m 21s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 14s, 500 more iterations: 7h 21m 10s. [2026-03-25 17:28:36,616][__main__][INFO] - Starting iteration 209. [2026-03-25 17:28:36,621][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:28:36,621][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:28:42,683][__main__][INFO] - Number of regex retries in iteration 209: 0 [2026-03-25 17:28:42,684][__main__][INFO] - agents played in iteration 209 are Alice, Bob [2026-03-25 17:28:43,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:28:43,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:28:43,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:28:43,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:28:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:28:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:28:45,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:28:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:28:46,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:28:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:28:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:28:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:28:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:28:49,913][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:28:50,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:28:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:55,845][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:59,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:00,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:01,117][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:01,775][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:02,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:04,410][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:05,728][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:06,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:07,045][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:29:08,363][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:29:09,023][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:29:09,682][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:10,340][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:12,316][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:12,974][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:14,290][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:15,962][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:17,936][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:19,255][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:19,913][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:29:21,888][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:29:22,546][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:29:23,207][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:29:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:29:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:29:25,182][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:29:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:29:26,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:29:27,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:29:28,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:28,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:28,628][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:30,021][__main__][INFO] - Iteration 210 took 53s (11.35% Gen, 86.03% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 41m 56s. Estimated total time: 14h 50m 2s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 0s, 500 more iterations: 7h 25m 1s. [2026-03-25 17:29:30,024][__main__][INFO] - Starting iteration 210. [2026-03-25 17:29:30,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:29:30,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:29:34,744][__main__][INFO] - Number of regex retries in iteration 210: 0 [2026-03-25 17:29:34,746][__main__][INFO] - agents played in iteration 210 are Alice, Bob [2026-03-25 17:29:35,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:29:35,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:29:35,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:29:35,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:29:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:29:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:29:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:29:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:29:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:29:39,292][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:29:39,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:29:40,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:29:41,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:29:41,928][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:29:42,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:29:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:29:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:29:44,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:29:45,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:29:45,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:29:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:29:47,201][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:29:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:49,842][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:50,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:29:51,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:53,796][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:55,776][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:56,437][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:00,389][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:01,707][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:02,365][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:03,683][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:04,344][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:30:07,979][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:30:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:30:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:30:09,957][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:30:10,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:30:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:30:11,939][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:30:12,598][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:30:13,257][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:14,576][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:16,556][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:18,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:19,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:30:20,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:20,622][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:20,623][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:21,925][__main__][INFO] - Iteration 211 took 51s (9.09% Gen, 88.40% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 16m 1s. Estimated total time: 14h 24m 59s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 29s. [2026-03-25 17:30:21,927][__main__][INFO] - Starting iteration 211. [2026-03-25 17:30:21,931][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:30:21,932][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:27,234][__main__][INFO] - Number of regex retries in iteration 211: 0 [2026-03-25 17:30:27,235][__main__][INFO] - agents played in iteration 211 are Alice, Bob [2026-03-25 17:30:27,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:27,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:27,769][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:30:27,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:30:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:29,109][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:29,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:31,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:32,406][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:30:33,065][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:30:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:30:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:30:35,042][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:30:35,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:30:36,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:30:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:30:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:30:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:30:38,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:30:39,657][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:30:40,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:30:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:30:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:30:42,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:30:42,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:30:43,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:44,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:30:47,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:30:48,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:30:48,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:30:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:30:50,211][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:30:50,874][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:30:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:30:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:55,495][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:56,157][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:56,818][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:58,137][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:58,797][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:59,457][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:03,104][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:31:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:31:09,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:31:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:31:10,350][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:31:11,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:31:11,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:31:13,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:31:13,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:31:13,026][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:31:14,308][__main__][INFO] - Iteration 212 took 52s (10.12% Gen, 87.42% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 23m 8s. Estimated total time: 14h 32m 58s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 29s. [2026-03-25 17:31:14,310][__main__][INFO] - Starting iteration 212. [2026-03-25 17:31:14,314][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:31:14,315][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:19,243][__main__][INFO] - Number of regex retries in iteration 212: 0 [2026-03-25 17:31:19,245][__main__][INFO] - agents played in iteration 212 are Alice, Bob [2026-03-25 17:31:19,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:31:19,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:31:19,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:31:19,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:31:20,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:21,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:21,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:23,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:27,851][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:29,169][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:29,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:31,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:31:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:31:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:31:33,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:31:33,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:31:34,450][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:31:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:31:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:31:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:31:37,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:31:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:31:38,408][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:31:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:39,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:40,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:31:43,689][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:31:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:31:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:31:45,668][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:31:46,328][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:31:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:31:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:31:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:31:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:31:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:31:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:31:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:31:51,605][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:53,928][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:55,908][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:57,226][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:59,864][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:01,187][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:01,851][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:03,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:04,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:32:05,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:05,275][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:05,276][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:06,671][__main__][INFO] - Iteration 213 took 52s (9.42% Gen, 87.91% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 21m 55s. Estimated total time: 14h 32m 38s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 19s. [2026-03-25 17:32:06,674][__main__][INFO] - Starting iteration 213. [2026-03-25 17:32:06,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:32:06,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:32:12,801][__main__][INFO] - Number of regex retries in iteration 213: 0 [2026-03-25 17:32:12,802][__main__][INFO] - agents played in iteration 213 are Alice, Bob [2026-03-25 17:32:13,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:32:13,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:32:13,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:32:13,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:32:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:32:14,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:32:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:32:15,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:32:16,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:32:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:32:17,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:32:18,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:21,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:21,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:23,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:32:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:26,513][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:32:27,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:32:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:32:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:32:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:32:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:32:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:32:31,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:32:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:32:32,455][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:32:33,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:32:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:32:34,434][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:32:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:32:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:32:36,412][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:37,731][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:39,709][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:32:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:32:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:32:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:32:43,005][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:32:43,663][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:32:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:32:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:32:45,988][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:32:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:32:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:32:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:32:48,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:32:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:32:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:32:50,606][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:32:51,265][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:32:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:32:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:32:53,249][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:56,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:57,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:32:58,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:58,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:58,540][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:59,849][__main__][INFO] - Iteration 214 took 53s (11.51% Gen, 86.02% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 34m 37s. Estimated total time: 14h 46m 13s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 37s, 500 more iterations: 7h 23m 6s. [2026-03-25 17:32:59,851][__main__][INFO] - Starting iteration 214. [2026-03-25 17:32:59,856][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:32:59,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:05,208][__main__][INFO] - Number of regex retries in iteration 214: 0 [2026-03-25 17:33:05,210][__main__][INFO] - agents played in iteration 214 are Alice, Bob [2026-03-25 17:33:05,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:05,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:05,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:33:05,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:33:06,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:33:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:33:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:33:08,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:33:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:33:09,804][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:33:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:33:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:33:11,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:33:12,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:33:13,102][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:33:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:33:14,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:33:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:33:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:33:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:21,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:22,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:23,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:24,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:33:28,286][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:33:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:33:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:33:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:33:30,924][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:33:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:33:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:33:32,901][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:33,560][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:36,197][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:33:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:33:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:33:39,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:33:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:33:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:33:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:33:41,819][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:33:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:33:43,139][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:33:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:33:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:33:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:33:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:33:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:33:47,091][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:33:47,751][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:33:48,411][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:33:49,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:33:49,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:33:51,882][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:33:51,885][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:33:51,886][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:33:53,342][__main__][INFO] - Iteration 215 took 53s (10.01% Gen, 87.26% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 38m 59s. Estimated total time: 14h 51m 28s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 8s, 500 more iterations: 7h 25m 44s. [2026-03-25 17:33:53,345][__main__][INFO] - Starting iteration 215. [2026-03-25 17:33:53,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:33:53,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:58,267][__main__][INFO] - Number of regex retries in iteration 215: 0 [2026-03-25 17:33:58,269][__main__][INFO] - agents played in iteration 215 are Alice, Bob [2026-03-25 17:33:58,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:58,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:58,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:33:58,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:33:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:04,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:07,474][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:34:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:34:08,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:34:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:34:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:34:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:34:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:34:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:34:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:34:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:34:14,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:34:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:34:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:34:16,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:34:16,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:34:17,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:34:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:34:18,710][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:20,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:22,665][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:24,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:25,960][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:27,279][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:29,259][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:31,592][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:32,254][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:32,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:33,573][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:34:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:34:34,895][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:34:35,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:34:36,214][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:34:36,872][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:34:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:34:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:34:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:34:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:34:40,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:34:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:34:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:34:42,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:34:42,950][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:34:44,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:34:44,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:34:44,130][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:34:45,436][__main__][INFO] - Iteration 216 took 52s (9.44% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 14m 48s. Estimated total time: 14h 28m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 4s. [2026-03-25 17:34:45,439][__main__][INFO] - Starting iteration 216. [2026-03-25 17:34:45,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:34:45,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:34:50,238][__main__][INFO] - Number of regex retries in iteration 216: 0 [2026-03-25 17:34:50,239][__main__][INFO] - agents played in iteration 216 are Alice, Bob [2026-03-25 17:34:50,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:34:50,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:34:50,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:34:50,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:34:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:52,951][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:55,588][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:56,250][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:56,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:57,571][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:02,848][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:06,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:35:07,463][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:35:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:35:08,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:35:09,443][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:35:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:35:10,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:35:11,422][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:35:12,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:35:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:35:13,397][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:35:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:35:14,715][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:35:15,374][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:35:16,033][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:35:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:35:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:35:18,008][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:18,666][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:19,324][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:20,642][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:21,301][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:21,961][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:23,617][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:27,575][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:29,550][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:30,209][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:31,525][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:32,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:34,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:34,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:35:36,500][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:35:36,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:35:36,505][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:35:38,104][__main__][INFO] - Iteration 217 took 52s (9.11% Gen, 87.85% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 23m 29s. Estimated total time: 14h 37m 43s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 51s. [2026-03-25 17:35:38,106][__main__][INFO] - Starting iteration 217. [2026-03-25 17:35:38,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:35:38,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:35:43,250][__main__][INFO] - Number of regex retries in iteration 217: 0 [2026-03-25 17:35:43,251][__main__][INFO] - agents played in iteration 217 are Alice, Bob [2026-03-25 17:35:43,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:35:43,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:35:43,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:35:43,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:35:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:35:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:35:45,818][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:35:46,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:35:47,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:35:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:35:48,461][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:35:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:35:49,782][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:35:50,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:35:51,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:35:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:35:52,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:55,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:58,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:01,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:01,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:02,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:04,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:06,936][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:36:07,595][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:36:08,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:36:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:36:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:36:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:36:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:36:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:36:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:36:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:36:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:36:14,183][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:36:14,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:36:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:36:16,504][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:36:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:36:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:36:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:19,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:19,798][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:20,456][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:21,114][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:21,773][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:22,431][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:23,090][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:25,733][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:36:27,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:36:27,947][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:36:29,078][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:29,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:29,083][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:30,845][__main__][INFO] - Iteration 218 took 52s (9.75% Gen, 86.90% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 23m 50s. Estimated total time: 14h 38m 57s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 53s, 500 more iterations: 7h 19m 28s. [2026-03-25 17:36:30,848][__main__][INFO] - Starting iteration 218. [2026-03-25 17:36:30,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:36:30,853][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:36,277][__main__][INFO] - Number of regex retries in iteration 218: 0 [2026-03-25 17:36:36,279][__main__][INFO] - agents played in iteration 218 are Alice, Bob [2026-03-25 17:36:36,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:36:36,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:36:36,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:36:36,828][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:36:37,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:36:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:36:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:36:39,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:36:40,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:36:40,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:36:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:36:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:36:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:36:43,394][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:36:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:36:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:36:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:36:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:36:46,691][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:36:47,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:36:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:36:48,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:36:49,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:36:49,990][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:36:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:36:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:36:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:36:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:53,293][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:55,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:55,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:58,569][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:59,228][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:59,887][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:37:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:37:01,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:37:01,864][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:37:02,523][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:37:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:37:03,840][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:37:04,500][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:37:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:37:05,819][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:37:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:37:07,140][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:37:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:37:08,459][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:37:09,458][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:37:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:37:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:37:11,438][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:37:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:37:12,758][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:37:13,416][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:37:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:37:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:37:15,395][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:37:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:37:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:37:17,370][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:37:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:37:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:37:19,346][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:20,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:20,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:37:22,078][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:22,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:22,082][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:37:23,448][__main__][INFO] - Iteration 219 took 52s (10.31% Gen, 87.08% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 20m 38s. Estimated total time: 14h 36m 37s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 18s. [2026-03-25 17:37:23,451][__main__][INFO] - Starting iteration 219. [2026-03-25 17:37:23,455][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:37:23,456][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:37:43,573][__main__][INFO] - Number of regex retries in iteration 219: 0 [2026-03-25 17:37:43,575][__main__][INFO] - agents played in iteration 219 are Alice, Bob [2026-03-25 17:37:44,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:37:44,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:37:44,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:37:44,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:37:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:45,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:46,160][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:46,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:48,139][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:49,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:50,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:52,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:55,390][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:37:56,050][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:37:56,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:37:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:37:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:37:58,690][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:37:59,349][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:01,327][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:02,647][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:03,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:03,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:04,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:05,282][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:05,940][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:38:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:38:09,233][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:38:09,892][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:38:10,551][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:38:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:38:11,871][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:38:12,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:38:13,189][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:38:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:38:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:38:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:38:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:38:16,821][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:38:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:38:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:38:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:38:19,457][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:38:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:38:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:38:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:38:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:38:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:38:23,415][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:38:24,074][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:38:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:38:25,393][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:38:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:38:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:38:27,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:38:28,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:38:29,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:38:29,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:38:29,305][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:38:30,662][__main__][INFO] - Iteration 220 took 1m 7s (29.93% Gen, 68.04% Train). Generation: 20s, Training: 45s. Estimated remaining time: 15h 23m 3s. Estimated total time: 18h 40m 9s. Time estimates for 10 more iterations: 11m 12s, 100 more iterations: 1h 52m 0s, 500 more iterations: 9h 20m 4s. [2026-03-25 17:38:30,665][__main__][INFO] - Starting iteration 220. [2026-03-25 17:38:30,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:38:30,670][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:38:35,459][__main__][INFO] - Number of regex retries in iteration 220: 0 [2026-03-25 17:38:35,460][__main__][INFO] - agents played in iteration 220 are Alice, Bob [2026-03-25 17:38:36,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:38:36,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:38:36,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:38:36,107][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:38:36,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:38:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:38:38,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:38:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:38:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:38:40,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:38:40,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:38:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:38:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:38:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:38:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:38:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:38:44,695][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:38:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:38:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:38:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:38:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:49,314][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:49,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:50,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:53,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:56,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:59,204][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:03,817][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:04,477][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:05,137][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:05,797][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:07,115][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:07,773][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:10,752][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:11,410][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:39:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:39:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:39:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:39:14,048][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:39:14,709][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:39:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:39:16,029][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:39:16,689][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:39:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:39:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:39:18,668][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:39:19,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:39:20,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:39:21,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:39:21,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:39:21,410][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:39:23,507][__main__][INFO] - Iteration 221 took 52s (9.07% Gen, 86.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 22m 40s. Estimated total time: 14h 40m 39s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 3s, 500 more iterations: 7h 20m 19s. [2026-03-25 17:39:23,509][__main__][INFO] - Starting iteration 221. [2026-03-25 17:39:23,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:39:23,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:39:29,427][__main__][INFO] - Number of regex retries in iteration 221: 0 [2026-03-25 17:39:29,427][__main__][INFO] - agents played in iteration 221 are Alice, Bob [2026-03-25 17:39:29,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:39:29,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:39:29,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:39:29,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:39:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:39:31,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:39:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:39:32,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:39:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:39:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:39:34,592][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:39:35,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:39:35,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:39:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:39:37,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:39:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:39:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:39:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:39:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:39:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:39:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:39:41,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:39:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:39:43,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:39:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:39:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:39:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:39:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:39:46,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:39:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:39:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:39:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:39:49,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:39:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:39:50,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:39:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:39:51,726][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:39:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:53,704][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:54,365][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:55,024][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:56,348][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:57,014][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:58,994][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:03,333][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:04,652][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:05,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:06,632][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:07,290][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:07,949][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:09,268][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:09,929][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:10,587][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:11,246][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:11,905][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:13,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:14,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:40:15,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:40:15,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:40:15,285][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:40:16,581][__main__][INFO] - Iteration 222 took 53s (11.14% Gen, 86.41% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 25m 36s. Estimated total time: 14h 44m 29s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 14s. [2026-03-25 17:40:16,583][__main__][INFO] - Starting iteration 222. [2026-03-25 17:40:16,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:40:16,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:40:21,480][__main__][INFO] - Number of regex retries in iteration 222: 0 [2026-03-25 17:40:21,481][__main__][INFO] - agents played in iteration 222 are Alice, Bob [2026-03-25 17:40:21,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:40:22,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:40:22,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:40:22,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:40:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:40:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:40:24,106][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:40:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:40:25,423][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:40:26,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:40:26,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:40:27,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:40:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:40:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:40:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:40:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:40:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:40:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:40:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:40:32,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:40:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:40:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:40:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:40:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:40:35,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:40:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:40:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:40:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:40:38,609][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:40:39,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:40:39,927][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:40:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:40:41,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:40:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:40:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:40:43,224][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:40:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:40:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:40:45,200][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:40:45,860][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:40:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:40:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:40:47,841][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:40:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:40:49,159][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:40:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:40:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:40:51,134][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:40:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:52,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:55,439][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:57,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:58,078][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:58,740][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:59,400][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:01,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:04,020][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:05,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:06,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:41:07,279][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:07,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:07,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:08,685][__main__][INFO] - Iteration 223 took 52s (9.39% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 8m 35s. Estimated total time: 14h 28m 20s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 10s. [2026-03-25 17:41:08,687][__main__][INFO] - Starting iteration 223. [2026-03-25 17:41:08,692][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:41:08,693][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:13,706][__main__][INFO] - Number of regex retries in iteration 223: 0 [2026-03-25 17:41:13,708][__main__][INFO] - agents played in iteration 223 are Alice, Bob [2026-03-25 17:41:14,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:41:14,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:41:14,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:41:14,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:41:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:41:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:41:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:41:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:41:17,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:41:18,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:41:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:41:19,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:41:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:41:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:41:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:41:22,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:41:22,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:41:23,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:41:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:41:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:41:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:41:26,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:41:26,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:41:27,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:41:28,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:41:28,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:41:29,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:41:30,189][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:41:30,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:41:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:41:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:41:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:41:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:41:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:41:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:41:35,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:41:36,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:41:36,791][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:41:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:41:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:41:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:41:39,436][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:41:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:41:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:41:41,418][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:41:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:41:42,739][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:41:43,403][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:41:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:41:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:41:45,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:41:46,050][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:41:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:41:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:41:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:41:49,044][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:50,365][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:51,684][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:53,661][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:54,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:54,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:57,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:58,483][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:41:59,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:59,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:59,643][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:01,059][__main__][INFO] - Iteration 224 took 52s (9.58% Gen, 87.71% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 12m 12s. Estimated total time: 14h 32m 49s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 24s. [2026-03-25 17:42:01,061][__main__][INFO] - Starting iteration 224. [2026-03-25 17:42:01,066][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:42:01,067][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:06,049][__main__][INFO] - Number of regex retries in iteration 224: 0 [2026-03-25 17:42:06,050][__main__][INFO] - agents played in iteration 224 are Alice, Bob [2026-03-25 17:42:06,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:42:06,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:42:06,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:42:06,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:42:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:10,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:11,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:14,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:15,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:17,954][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:19,270][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:42:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:42:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:42:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:42:21,906][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:42:22,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:42:23,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:42:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:42:24,540][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:42:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:42:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:42:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:42:27,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:42:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:42:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:42:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:42:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:42:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:42:31,128][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:42:31,786][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:42:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:42:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:42:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:42:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:42:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:42:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:42:36,395][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:42:37,055][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:42:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:42:38,372][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:42:39,370][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:42:40,030][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:42:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:42:41,348][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:42:42,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:42:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:42:43,325][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:42:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:42:44,643][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:42:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:49,256][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:42:49,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:42:50,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:42:51,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:42:51,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:42:51,986][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:53,514][__main__][INFO] - Iteration 225 took 52s (9.50% Gen, 87.58% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 12m 41s. Estimated total time: 14h 34m 10s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 5s. [2026-03-25 17:42:53,517][__main__][INFO] - Starting iteration 225. [2026-03-25 17:42:53,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:42:53,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:59,434][__main__][INFO] - Number of regex retries in iteration 225: 0 [2026-03-25 17:42:59,435][__main__][INFO] - agents played in iteration 225 are Alice, Bob [2026-03-25 17:43:00,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:00,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:00,074][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:43:00,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:43:00,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:01,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:03,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:04,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:05,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:06,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:43:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:43:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:43:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:43:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:43:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:43:11,352][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:43:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:43:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:13,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:13,988][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:14,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:15,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:15,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:18,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:21,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:22,553][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:23,211][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:43:25,187][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:43:25,846][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:43:26,505][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:43:27,164][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:43:27,823][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:43:28,482][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:43:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:43:29,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:43:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:43:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:43:31,776][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:43:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:43:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:43:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:43:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:43:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:43:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:43:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:43:37,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:43:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:43:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:43:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:43:40,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:43:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:43:41,341][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:43:42,000][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:43:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:43,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:44,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:43:45,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:45,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:45,307][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:43:46,622][__main__][INFO] - Iteration 226 took 53s (11.13% Gen, 86.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 22m 41s. Estimated total time: 14h 45m 3s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 30s, 500 more iterations: 7h 22m 31s. [2026-03-25 17:43:46,625][__main__][INFO] - Starting iteration 226. [2026-03-25 17:43:46,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:43:46,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:51,480][__main__][INFO] - Number of regex retries in iteration 226: 0 [2026-03-25 17:43:51,481][__main__][INFO] - agents played in iteration 226 are Alice, Bob [2026-03-25 17:43:51,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:52,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:52,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:43:52,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:43:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:53,434][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:56,732][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:57,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:59,366][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:00,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:02,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:02,659][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:44:07,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:44:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:44:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:44:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:44:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:44:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:44:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:44:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:44:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:44:13,857][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:44:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:44:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:44:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:44:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:21,761][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:22,419][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:23,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:24,726][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:26,045][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:44:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:44:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:44:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:44:29,999][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:44:30,657][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:44:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:44:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:44:32,632][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:44:33,290][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:44:33,948][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:44:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:44:35,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:44:36,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:44:37,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:44:37,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:44:37,254][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:38,568][__main__][INFO] - Iteration 227 took 51s (9.34% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 2m 25s. Estimated total time: 14h 25m 40s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 50s. [2026-03-25 17:44:38,570][__main__][INFO] - Starting iteration 227. [2026-03-25 17:44:38,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:44:38,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:44:44,037][__main__][INFO] - Number of regex retries in iteration 227: 0 [2026-03-25 17:44:44,039][__main__][INFO] - agents played in iteration 227 are Alice, Bob [2026-03-25 17:44:44,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:44:44,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:44:44,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:44:44,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:44:45,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:46,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:47,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:44:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:44:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:44:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:44:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:53,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:54,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:55,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:56,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:57,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:59,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:01,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:01,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:02,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:03,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:05,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:05,847][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:06,505][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:45:07,825][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:45:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:45:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:45:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:45:10,460][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:45:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:45:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:45:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:45:13,098][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:45:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:45:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:45:15,075][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:45:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:45:16,395][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:45:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:45:18,069][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:45:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:45:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:45:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:20,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:22,030][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:22,689][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:24,009][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:24,668][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:27,312][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:27,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:28,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:45:29,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:29,954][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:29,956][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:45:31,383][__main__][INFO] - Iteration 228 took 52s (10.35% Gen, 86.95% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 16m 3s. Estimated total time: 14h 40m 11s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 5s. [2026-03-25 17:45:31,385][__main__][INFO] - Starting iteration 228. [2026-03-25 17:45:31,389][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:45:31,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:45:36,509][__main__][INFO] - Number of regex retries in iteration 228: 0 [2026-03-25 17:45:36,511][__main__][INFO] - agents played in iteration 228 are Alice, Bob [2026-03-25 17:45:36,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:45:37,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:45:37,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:45:37,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:45:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:45:38,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:45:39,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:45:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:45:40,348][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:45:41,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:45:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:42,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:43,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:46,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:45:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:45:47,593][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:45:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:45:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:45:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:45:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:45:50,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:45:51,563][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:45:52,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:54,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:56,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:58,819][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:59,479][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:01,460][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:02,119][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:02,778][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:46:07,391][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:46:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:46:08,708][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:46:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:46:10,398][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:46:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:46:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:46:12,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:46:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:46:13,705][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:46:14,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:46:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:46:15,690][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:46:16,350][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:46:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:46:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:46:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:46:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:46:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:46:20,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:46:21,273][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:46:22,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:46:22,439][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:46:22,440][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:23,807][__main__][INFO] - Iteration 229 took 52s (9.77% Gen, 87.62% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 8m 40s. Estimated total time: 14h 33m 40s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 50s. [2026-03-25 17:46:23,814][__main__][INFO] - Starting iteration 229. [2026-03-25 17:46:23,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:46:23,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:29,203][__main__][INFO] - Number of regex retries in iteration 229: 0 [2026-03-25 17:46:29,204][__main__][INFO] - agents played in iteration 229 are Alice, Bob [2026-03-25 17:46:29,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:46:29,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:46:29,852][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:46:29,852][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:46:30,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:31,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:32,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:46:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:46:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:46:34,615][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:46:35,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:46:35,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:46:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:46:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:46:37,916][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:46:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:46:39,245][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:41,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:43,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:44,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:46:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:46:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:46:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:46:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:46:48,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:46:49,154][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:46:49,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:46:50,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:46:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:46:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:46:52,453][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:54,430][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:55,090][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:55,750][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:57,745][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:58,405][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:00,387][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:01,047][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:01,707][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:04,151][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:04,810][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:05,467][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:06,788][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:47:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:47:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:47:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:47:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:47:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:47:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:47:12,730][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:47:13,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:47:14,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:47:15,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:47:15,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:47:15,398][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:47:16,754][__main__][INFO] - Iteration 230 took 52s (10.16% Gen, 87.27% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 16m 18s. Estimated total time: 14h 42m 11s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 13s, 500 more iterations: 7h 21m 5s. [2026-03-25 17:47:16,757][__main__][INFO] - Starting iteration 230. [2026-03-25 17:47:16,762][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:47:16,763][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:21,501][__main__][INFO] - Number of regex retries in iteration 230: 0 [2026-03-25 17:47:21,502][__main__][INFO] - agents played in iteration 230 are Alice, Bob [2026-03-25 17:47:22,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:47:22,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:47:22,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:47:22,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:47:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:23,448][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:47:24,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:47:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:30,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:34,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:35,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:47:36,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:47:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:39,245][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:39,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:40,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:47:44,514][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:47:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:47:45,830][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:47:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:47:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:47:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:47:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:47:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:47:49,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:47:50,436][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:47:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:47:51,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:54,710][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:57,343][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:01,298][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:02,615][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:03,273][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:05,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:06,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:48:07,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:07,161][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:07,162][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:48:08,771][__main__][INFO] - Iteration 231 took 52s (9.11% Gen, 87.79% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 0m 6s. Estimated total time: 14h 26m 50s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 25s. [2026-03-25 17:48:08,773][__main__][INFO] - Starting iteration 231. [2026-03-25 17:48:08,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:48:08,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:48:13,659][__main__][INFO] - Number of regex retries in iteration 231: 0 [2026-03-25 17:48:13,660][__main__][INFO] - agents played in iteration 231 are Alice, Bob [2026-03-25 17:48:14,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:48:14,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:48:14,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:48:14,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:48:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:48:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:48:16,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:48:16,983][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:48:17,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:48:18,301][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:48:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:48:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:48:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:48:20,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:48:21,591][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:48:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:48:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:48:23,564][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:48:24,223][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:48:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:48:25,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:48:26,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:48:26,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:48:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:48:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:29,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:33,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:34,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:35,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:36,738][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:38,055][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:38,713][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:39,371][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:48:40,688][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:48:41,346][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:48:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:48:42,662][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:48:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:48:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:48:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:48:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:48:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:48:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:48:47,606][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:48:48,265][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:48:48,923][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:48:49,582][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:48:50,240][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:48:50,899][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:48:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:48:52,216][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:53,535][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:54,194][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:54,853][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:56,173][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:57,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:58,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:48:59,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:59,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:59,406][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:00,804][__main__][INFO] - Iteration 232 took 52s (9.38% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 59m 32s. Estimated total time: 14h 27m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 34s. [2026-03-25 17:49:00,807][__main__][INFO] - Starting iteration 232. [2026-03-25 17:49:00,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:49:00,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:05,708][__main__][INFO] - Number of regex retries in iteration 232: 0 [2026-03-25 17:49:05,709][__main__][INFO] - agents played in iteration 232 are Alice, Bob [2026-03-25 17:49:06,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:49:06,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:49:06,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:49:06,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:49:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:07,596][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:08,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:49:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:49:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:49:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:49:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:49:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:49:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:49:13,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:49:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:49:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:49:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:49:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:49:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:49:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:49:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:49:18,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:49:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:49:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:49:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:49:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:49:22,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:49:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:49:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:49:24,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:49:24,720][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:49:25,378][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:49:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:49:26,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:49:27,353][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:49:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:49:28,669][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:49:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:49:29,986][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:49:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:49:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:31,961][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:32,620][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:33,936][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:34,594][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:40,203][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:49:43,499][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:49:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:49:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:49:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:49:46,136][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:49:46,795][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:49:47,454][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:49:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:49:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:49:49,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:49:50,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:49:51,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:49:51,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:49:51,423][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:52,890][__main__][INFO] - Iteration 233 took 52s (9.40% Gen, 87.78% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 59m 32s. Estimated total time: 14h 28m 1s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 0s. [2026-03-25 17:49:52,892][__main__][INFO] - Starting iteration 233. [2026-03-25 17:49:52,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:49:52,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:59,879][__main__][INFO] - Number of regex retries in iteration 233: 0 [2026-03-25 17:49:59,880][__main__][INFO] - agents played in iteration 233 are Alice, Bob [2026-03-25 17:50:00,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:00,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:00,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:50:00,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:50:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:02,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:03,099][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:05,075][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:07,051][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:50:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:50:08,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:50:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:50:09,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:50:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:50:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:50:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:50:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:50:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:50:13,642][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:50:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:50:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:50:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:50:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:50:16,936][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:50:17,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:50:18,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:50:18,913][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:50:19,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:50:20,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:50:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:50:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:50:22,207][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:50:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:50:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:50:24,187][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:50:24,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:50:25,504][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:50:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:50:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:50:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:50:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:50:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:50:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:32,097][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:50:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:50:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:50:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:50:37,030][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:41,642][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:42,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:43,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:44,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:50:45,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:45,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:45,669][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:47,230][__main__][INFO] - Iteration 234 took 54s (12.85% Gen, 84.27% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 36m 12s. Estimated total time: 15h 5m 35s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 33s, 500 more iterations: 7h 32m 47s. [2026-03-25 17:50:47,233][__main__][INFO] - Starting iteration 234. [2026-03-25 17:50:47,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:50:47,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:50:52,301][__main__][INFO] - Number of regex retries in iteration 234: 0 [2026-03-25 17:50:52,302][__main__][INFO] - agents played in iteration 234 are Alice, Bob [2026-03-25 17:50:52,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:52,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:52,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:50:52,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:50:53,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:54,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:54,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:56,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:57,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:58,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:59,465][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:00,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:00,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:02,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:03,416][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:05,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:06,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:51:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:51:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:51:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:51:10,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:51:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:51:11,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:51:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:51:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:51:13,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:51:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:51:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:51:15,277][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:51:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:51:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:51:17,255][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:51:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:51:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:51:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:51:19,894][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:51:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:51:21,212][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:51:21,871][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:51:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:51:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:51:23,847][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:51:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:51:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:51:26,123][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:51:26,782][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:27,443][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:28,101][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:28,759][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:32,055][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:51:33,372][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:51:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:51:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:51:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:51:36,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:51:36,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:51:37,916][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:51:37,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:51:37,925][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:51:39,153][__main__][INFO] - Iteration 235 took 51s (9.76% Gen, 87.88% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 55m 2s. Estimated total time: 14h 25m 17s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 38s. [2026-03-25 17:51:39,155][__main__][INFO] - Starting iteration 235. [2026-03-25 17:51:39,159][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:51:39,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:44,437][__main__][INFO] - Number of regex retries in iteration 235: 0 [2026-03-25 17:51:44,438][__main__][INFO] - agents played in iteration 235 are Alice, Bob [2026-03-25 17:51:45,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:51:45,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:51:45,107][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:51:45,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:51:45,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:51:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:51:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:51:49,704][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:51:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:51:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:52,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:52,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:53,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:54,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:56,942][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:58,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:59,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:52:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:52:00,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:52:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:52:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:52:02,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:52:03,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:52:04,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:52:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:52:05,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:52:06,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:52:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:52:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:52:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:52:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:52:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:52:10,101][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:52:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:52:11,417][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:52:12,076][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:52:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:52:13,392][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:52:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:52:14,716][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:52:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:52:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:52:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:52:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:52:18,312][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:52:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:52:19,629][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:52:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:52:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:52:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:52:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:52:22,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:52:23,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:52:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:52:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:52:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:52:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:52:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:52:27,536][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:52:28,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:52:29,010][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:52:30,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:52:30,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:52:30,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:52:31,566][__main__][INFO] - Iteration 236 took 52s (10.07% Gen, 87.32% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 2m 22s. Estimated total time: 14h 33m 29s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 44s. [2026-03-25 17:52:31,569][__main__][INFO] - Starting iteration 236. [2026-03-25 17:52:31,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:52:31,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:52:46,118][__main__][INFO] - Number of regex retries in iteration 236: 0 [2026-03-25 17:52:46,118][__main__][INFO] - agents played in iteration 236 are Alice, Bob [2026-03-25 17:52:46,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:52:46,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:52:46,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:52:46,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:52:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:52:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:52:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:52:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:50,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:52,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:53,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:56,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:59,158][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:53:00,472][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:01,787][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:04,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:05,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:09,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:10,341][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:53:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:53:12,972][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:53:13,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:53:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:53:14,945][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:53:15,603][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:53:16,260][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:53:16,918][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:53:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:53:18,234][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:53:19,122][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:53:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:53:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:53:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:53:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:53:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:53:23,070][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:53:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:53:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:53:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:53:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:53:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:53:27,021][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:53:27,679][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:53:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:53:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:53:29,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:53:30,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:53:31,570][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:53:31,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:53:31,575][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:53:32,821][__main__][INFO] - Iteration 237 took 1m 1s (23.75% Gen, 74.21% Train). Generation: 14s, Training: 45s. Estimated remaining time: 13h 28m 41s. Estimated total time: 17h 0m 50s. Time estimates for 10 more iterations: 10m 12s, 100 more iterations: 1h 42m 5s, 500 more iterations: 8h 30m 25s. [2026-03-25 17:53:32,824][__main__][INFO] - Starting iteration 237. [2026-03-25 17:53:32,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:53:32,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:53:47,908][__main__][INFO] - Number of regex retries in iteration 237: 0 [2026-03-25 17:53:47,910][__main__][INFO] - agents played in iteration 237 are Alice, Bob [2026-03-25 17:53:48,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:53:48,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:53:48,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:53:48,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:53:49,199][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:53:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:53:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:53:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:53:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:53:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:53:53,093][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:53:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:53:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:53:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:53:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:53:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:53:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:53:57,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:53:58,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:53:59,016][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:53:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:00,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:00,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:02,967][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:03,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:04,284][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:04,943][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:05,601][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:54:08,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:54:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:54:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:54:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:54:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:54:11,529][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:54:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:54:12,846][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:54:13,505][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:14,163][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:14,822][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:15,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:18,776][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:21,012][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:21,671][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:22,989][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:23,648][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:24,966][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:25,625][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:54:27,604][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:54:28,263][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:54:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:54:29,582][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:54:30,241][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:54:30,901][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:54:31,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:54:32,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:54:33,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:54:33,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:54:33,443][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:54:34,848][__main__][INFO] - Iteration 238 took 1m 2s (24.32% Gen, 73.42% Train). Generation: 15s, Training: 45s. Estimated remaining time: 13h 40m 31s. Estimated total time: 17h 13m 42s. Time estimates for 10 more iterations: 10m 20s, 100 more iterations: 1h 43m 22s, 500 more iterations: 8h 36m 51s. [2026-03-25 17:54:34,851][__main__][INFO] - Starting iteration 238. [2026-03-25 17:54:34,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:54:34,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:54:40,092][__main__][INFO] - Number of regex retries in iteration 238: 0 [2026-03-25 17:54:40,093][__main__][INFO] - agents played in iteration 238 are Alice, Bob [2026-03-25 17:54:40,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:54:40,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:54:40,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:54:40,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:54:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:54:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:54:42,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:54:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:54:44,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:54:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:54:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:54:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:54:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:54:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:54:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:54:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:54:49,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:54:50,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:54:50,750][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:54,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:54,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:59,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:59,963][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:00,621][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:01,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:01,936][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:02,594][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:04,567][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:06,541][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:55:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:55:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:55:09,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:55:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:55:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:55:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:55:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:55:13,379][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:55:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:55:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:55:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:55:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:55:16,668][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:17,984][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:19,963][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:21,940][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:23,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:23,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:24,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:55:25,899][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:25,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:25,909][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:55:27,140][__main__][INFO] - Iteration 239 took 52s (10.02% Gen, 87.62% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 57m 24s. Estimated total time: 14h 31m 27s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 43s. [2026-03-25 17:55:27,143][__main__][INFO] - Starting iteration 239. [2026-03-25 17:55:27,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:55:27,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:55:32,207][__main__][INFO] - Number of regex retries in iteration 239: 0 [2026-03-25 17:55:32,209][__main__][INFO] - agents played in iteration 239 are Alice, Bob [2026-03-25 17:55:32,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:55:32,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:55:32,952][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:55:32,952][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:55:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:55:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:55:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:55:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:55:36,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:55:36,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:55:37,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:55:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:55:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:55:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:55:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:55:40,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:55:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:55:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:55:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:55:43,423][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:55:44,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:55:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:55:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:55:46,053][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:55:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:55:47,369][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:55:48,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:55:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:55:49,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:55:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:55:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:55:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:55:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:53,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:54,607][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:56,583][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:59,874][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:00,533][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:01,849][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:03,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:03,824][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:56:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:56:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:56:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:56:07,411][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:56:08,070][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:56:08,728][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:56:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:56:10,045][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:56:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:56:11,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:56:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:56:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:56:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:56:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:15,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:16,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:56:18,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:18,023][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:18,024][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:19,280][__main__][INFO] - Iteration 240 took 52s (9.71% Gen, 87.88% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 53m 59s. Estimated total time: 14h 28m 54s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 27s. [2026-03-25 17:56:19,282][__main__][INFO] - Starting iteration 240. [2026-03-25 17:56:19,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:56:19,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:27,299][__main__][INFO] - Number of regex retries in iteration 240: 0 [2026-03-25 17:56:27,301][__main__][INFO] - agents played in iteration 240 are Alice, Bob [2026-03-25 17:56:27,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:56:28,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:56:28,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:56:28,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:56:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:29,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:56:30,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:56:31,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:56:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:56:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:56:33,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:56:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:56:34,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:56:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:56:35,865][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:56:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:56:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:56:37,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:56:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:56:39,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:56:39,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:56:40,469][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:56:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:56:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:42,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:44,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:56:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:56:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:56:47,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:56:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:56:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:56:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:56:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:56:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:56:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:56:52,308][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:56:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:56:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:56:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:56:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:58,893][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:59,551][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:04,399][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:05,715][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:07,690][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:57:08,348][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:57:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:57:09,664][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:57:10,323][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:57:10,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:57:11,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 17:57:12,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:57:12,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:57:12,804][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:14,354][__main__][INFO] - Iteration 241 took 55s (14.55% Gen, 82.63% Train). Generation: 8s, Training: 45s. Estimated remaining time: 11h 41m 59s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s. [2026-03-25 17:57:14,485][__main__][INFO] - Starting iteration 241. [2026-03-25 17:57:14,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:57:14,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:57:19,302][__main__][INFO] - Number of regex retries in iteration 241: 0 [2026-03-25 17:57:19,303][__main__][INFO] - agents played in iteration 241 are Alice, Bob [2026-03-25 17:57:19,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:57:20,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:57:20,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:57:20,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:57:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:57:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:57:22,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:22,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:23,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:25,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:27,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:27,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:29,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:33,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:57:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:57:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:57:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:57:36,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:57:37,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:57:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:57:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:57:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:57:40,472][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:57:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:57:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:57:42,451][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:57:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:57:43,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:57:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:57:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:57:45,746][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:57:46,406][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:57:47,064][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:57:47,725][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:57:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:57:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:57:49,704][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:57:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:57:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:57:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:52,592][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:53,911][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:58,524][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:00,500][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:01,819][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:02,478][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:03,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:03,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:58:05,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:05,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:05,127][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:06,517][__main__][INFO] - Iteration 242 took 52s (9.25% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 50m 27s. Estimated total time: 14h 27m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 34s. [2026-03-25 17:58:06,519][__main__][INFO] - Starting iteration 242. [2026-03-25 17:58:06,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:58:06,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:11,491][__main__][INFO] - Number of regex retries in iteration 242: 0 [2026-03-25 17:58:11,492][__main__][INFO] - agents played in iteration 242 are Alice, Bob [2026-03-25 17:58:12,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:58:12,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:58:12,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:58:12,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:58:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:58:13,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:58:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:58:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:58:15,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:58:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:58:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:58:17,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:58:18,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:58:18,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:58:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:58:20,078][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:58:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:58:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:58:22,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:58:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:58:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:58:24,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:58:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:58:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:58:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:27,327][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:27,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:29,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:31,940][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:32,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:35,234][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:58:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:58:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:58:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:58:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:58:40,507][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:58:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:58:41,826][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:58:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:58:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:58:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:58:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:58:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:58:46,042][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:58:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:58:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:58:48,018][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:58:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:58:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:58:49,993][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:58:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:58:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:58:51,968][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:53,284][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:53,943][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:55,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:56,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:58:57,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:57,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:57,212][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:59,053][__main__][INFO] - Iteration 243 took 52s (9.46% Gen, 87.03% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 57m 57s. Estimated total time: 14h 35m 32s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 33s, 500 more iterations: 7h 17m 46s. [2026-03-25 17:58:59,056][__main__][INFO] - Starting iteration 243. [2026-03-25 17:58:59,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:58:59,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:59:04,208][__main__][INFO] - Number of regex retries in iteration 243: 0 [2026-03-25 17:59:04,210][__main__][INFO] - agents played in iteration 243 are Alice, Bob [2026-03-25 17:59:04,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:04,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:04,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:59:04,925][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:59:05,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:06,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:59:07,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:59:08,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:59:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:59:09,530][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:59:10,187][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:59:10,846][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:59:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:59:12,163][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:59:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:59:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:59:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:59:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:59:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:59:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:59:16,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:59:17,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:59:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:59:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:59:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:59:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:59:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:59:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:59:22,040][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:59:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:59:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:59:24,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:59:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:59:25,333][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:59:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:26,653][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:27,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:28,630][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:59:29,947][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:30,605][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:33,897][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:35,214][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:36,530][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:37,523][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:38,182][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:39,500][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:59:41,477][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:59:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:59:42,794][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:59:43,454][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:59:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:59:44,774][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:59:45,434][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:59:46,092][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:59:46,751][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:59:47,410][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:59:48,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:59:48,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:59:50,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:59:50,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:59:50,613][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:59:52,644][__main__][INFO] - Iteration 244 took 53s (9.61% Gen, 86.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 14m 37s. Estimated total time: 14h 53m 5s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 18s, 500 more iterations: 7h 26m 32s. [2026-03-25 17:59:52,647][__main__][INFO] - Starting iteration 244. [2026-03-25 17:59:52,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:59:52,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:59:57,463][__main__][INFO] - Number of regex retries in iteration 244: 0 [2026-03-25 17:59:57,464][__main__][INFO] - agents played in iteration 244 are Alice, Bob [2026-03-25 17:59:57,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:58,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:58,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:59:58,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:59:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:59,328][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:03,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:06,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:00:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:00:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:00:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:00:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:00:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:00:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:00:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:00:12,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:00:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:00:13,811][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:00:14,470][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:00:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:00:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:00:16,447][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:00:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:00:17,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:00:18,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:00:19,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:00:19,744][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:00:20,403][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:00:21,063][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:00:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:00:22,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:23,698][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:26,336][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:00:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:00:27,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:00:28,314][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:00:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:00:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:00:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:00:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:00:31,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:00:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:00:33,222][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:00:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:36,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:41,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:41,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:00:43,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:43,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:43,082][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:44,440][__main__][INFO] - Iteration 245 took 51s (9.29% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 43m 51s. Estimated total time: 14h 23m 11s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 19s, 500 more iterations: 7h 11m 35s. [2026-03-25 18:00:44,443][__main__][INFO] - Starting iteration 245. [2026-03-25 18:00:44,449][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:00:44,449][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:49,876][__main__][INFO] - Number of regex retries in iteration 245: 0 [2026-03-25 18:00:49,878][__main__][INFO] - agents played in iteration 245 are Alice, Bob [2026-03-25 18:00:50,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:50,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:50,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:00:50,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:00:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:51,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:56,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:59,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:59,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:00,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:01,005][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:02,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:02,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:03,642][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:04,959][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:06,936][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:07,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:01:08,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:01:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:01:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:01:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:01:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:01:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:01:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:01:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:01:13,523][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:01:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:01:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:01:15,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:01:16,157][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:01:16,816][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:01:17,474][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:01:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:01:18,792][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:19,451][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:20,110][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:01:22,973][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:01:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:01:24,290][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:01:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:01:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:01:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:01:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:01:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:01:28,241][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:01:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:01:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:01:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:01:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:01:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:01:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:01:32,852][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:01:33,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:01:34,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:01:35,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:01:35,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:01:35,430][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:01:36,715][__main__][INFO] - Iteration 246 took 52s (10.39% Gen, 87.15% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 50m 56s. Estimated total time: 14h 31m 9s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 34s. [2026-03-25 18:01:36,718][__main__][INFO] - Starting iteration 246. [2026-03-25 18:01:36,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:01:36,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:41,902][__main__][INFO] - Number of regex retries in iteration 246: 0 [2026-03-25 18:01:41,903][__main__][INFO] - agents played in iteration 246 are Alice, Bob [2026-03-25 18:01:42,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:01:42,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:01:42,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:01:42,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:01:43,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:44,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:45,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:46,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:47,094][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:01:51,706][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:58,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:02,245][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:02:02,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:02:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:02:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:02:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:02:05,539][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:02:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:02:06,857][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:02:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:02:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:02:08,833][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:02:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:02:10,152][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:02:10,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:02:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:02:12,130][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:02:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:02:13,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:02:14,107][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:02:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:02:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:02:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:02:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:02:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:02:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:02:20,298][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:02:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:02:21,615][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:02:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:02:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:02:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:02:24,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:02:25,310][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:02:25,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:02:26,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:02:27,879][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:02:27,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:02:27,888][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:02:33,593][__main__][INFO] - Iteration 247 took 56s (9.11% Gen, 80.86% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 6m 39s. Estimated total time: 15h 47m 49s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 54s. [2026-03-25 18:02:33,652][__main__][INFO] - Starting iteration 247. [2026-03-25 18:02:33,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:02:33,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:02:38,907][__main__][INFO] - Number of regex retries in iteration 247: 0 [2026-03-25 18:02:38,908][__main__][INFO] - agents played in iteration 247 are Alice, Bob [2026-03-25 18:02:39,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:02:39,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:02:39,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:02:39,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:02:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:02:41,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:02:41,729][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:02:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:43,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:46,989][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:47,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:49,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:51,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:53,754][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:54,412][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:02:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:02:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:02:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:59,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:01,663][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:02,325][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:04,960][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:03:06,938][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:03:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:03:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:03:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:03:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:03:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:03:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:03:11,553][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:03:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:03:13,273][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:03:13,932][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:03:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:03:15,251][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:03:15,910][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:03:16,569][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:03:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:03:17,891][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:03:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:03:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:03:19,875][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:03:20,535][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:03:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:03:21,857][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:03:22,519][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:03:23,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:03:24,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:03:25,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:03:25,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:03:25,571][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:03:26,979][__main__][INFO] - Iteration 248 took 53s (9.76% Gen, 87.59% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 5m 50s. Estimated total time: 14h 47m 53s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 56s. [2026-03-25 18:03:26,989][__main__][INFO] - Starting iteration 248. [2026-03-25 18:03:27,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:03:27,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:03:32,001][__main__][INFO] - Number of regex retries in iteration 248: 0 [2026-03-25 18:03:32,002][__main__][INFO] - agents played in iteration 248 are Alice, Bob [2026-03-25 18:03:32,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:03:32,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:03:32,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:03:32,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:03:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:03:34,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:03:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:03:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:03:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:03:37,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:03:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:03:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:03:39,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:03:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:03:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:03:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:03:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:03:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:03:43,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:03:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:03:44,689][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:03:45,350][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:03:46,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:03:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:03:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:03:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:48,657][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:49,982][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:51,304][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:53,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:53,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:55,930][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:56,593][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:59,236][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:59,897][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:00,557][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:02,535][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:03,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:03,854][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:05,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:08,796][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:09,459][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:10,120][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:04:12,099][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:04:12,758][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:04:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:04:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:04:14,735][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:04:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:04:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:04:16,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:04:17,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:04:18,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:04:18,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:04:18,828][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:04:20,356][__main__][INFO] - Iteration 249 took 53s (9.32% Gen, 87.81% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 5m 48s. Estimated total time: 14h 48m 44s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 52s, 500 more iterations: 7h 24m 22s. [2026-03-25 18:04:20,410][__main__][INFO] - Starting iteration 249. [2026-03-25 18:04:20,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:04:20,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:04:25,617][__main__][INFO] - Number of regex retries in iteration 249: 0 [2026-03-25 18:04:25,619][__main__][INFO] - agents played in iteration 249 are Alice, Bob [2026-03-25 18:04:26,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:04:26,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:04:26,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:04:26,327][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:04:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:04:27,601][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:04:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:04:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:04:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:04:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:04:30,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:04:31,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:04:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:04:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:04:33,526][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:04:34,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:04:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:04:35,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:04:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:04:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:04:37,481][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:04:38,140][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:04:38,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:04:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:04:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:04:40,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:04:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:04:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:04:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:04:43,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:04:44,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:04:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:04:45,392][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:04:46,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:04:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:04:47,368][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:50,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:50,661][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:52,643][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:53,303][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:53,961][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:55,282][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:55,941][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:57,261][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:00,242][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:01,559][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:02,879][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:05,511][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:06,170][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:06,829][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:08,147][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:09,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:10,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:05:11,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:11,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:11,478][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:12,913][__main__][INFO] - Iteration 250 took 52s (9.91% Gen, 87.35% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 51m 11s. Estimated total time: 14h 35m 0s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 30s. [2026-03-25 18:05:12,915][__main__][INFO] - Starting iteration 250. [2026-03-25 18:05:12,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:05:12,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:05:17,748][__main__][INFO] - Number of regex retries in iteration 250: 0 [2026-03-25 18:05:17,750][__main__][INFO] - agents played in iteration 250 are Alice, Bob [2026-03-25 18:05:18,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:05:18,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:05:18,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:05:18,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:05:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:05:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:05:20,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:05:21,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:05:21,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:05:22,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:05:23,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:05:23,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:05:24,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:05:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:05:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:05:26,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:05:26,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:05:27,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:05:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:05:28,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:05:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:05:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:05:30,932][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:05:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:05:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:05:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:05:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:05:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:05:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:05:35,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:05:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:05:36,858][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:05:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:05:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:05:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:05:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:05:40,151][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:05:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:05:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:05:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:05:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:05:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:05:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:05:44,762][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:49,379][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:51,711][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:56,321][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:00,271][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:01,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:02,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:06:03,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:03,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:03,834][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:06,848][__main__][INFO] - Iteration 251 took 53s (8.96% Gen, 85.45% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 14m 7s. Estimated total time: 14h 58m 50s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 53s, 500 more iterations: 7h 29m 25s. [2026-03-25 18:06:06,851][__main__][INFO] - Starting iteration 251. [2026-03-25 18:06:06,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:06:06,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:12,050][__main__][INFO] - Number of regex retries in iteration 251: 0 [2026-03-25 18:06:12,051][__main__][INFO] - agents played in iteration 251 are Alice, Bob [2026-03-25 18:06:12,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:06:12,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:06:12,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:06:12,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:06:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:14,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:15,366][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:16,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:16,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:17,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:18,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:06:18,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:06:19,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:06:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:06:20,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:06:21,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:06:21,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:06:22,611][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:06:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:06:23,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:06:24,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:06:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:06:25,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:06:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:06:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:06:27,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:06:28,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:06:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:06:29,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:06:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:06:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:06:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:06:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:06:33,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:06:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:06:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:06:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:06:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:06:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:06:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:06:37,755][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:06:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:06:39,072][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:06:39,731][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:06:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:06:41,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:06:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:06:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:06:43,025][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:06:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:06:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:06:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:06:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:47,966][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:51,258][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:53,234][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:55,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:56,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:06:57,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:57,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:57,957][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:59,477][__main__][INFO] - Iteration 252 took 52s (9.88% Gen, 87.23% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 51m 29s. Estimated total time: 14h 37m 4s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 32s. [2026-03-25 18:06:59,480][__main__][INFO] - Starting iteration 252. [2026-03-25 18:06:59,484][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:06:59,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:06,939][__main__][INFO] - Number of regex retries in iteration 252: 0 [2026-03-25 18:07:06,940][__main__][INFO] - agents played in iteration 252 are Alice, Bob [2026-03-25 18:07:07,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:07:07,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:07:07,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:07:07,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:07:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:07:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:07:09,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:07:10,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:07:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:07:11,544][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:07:12,202][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:07:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:14,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:14,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:15,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:16,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:16,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:18,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:18,787][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:21,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:22,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:24,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:07:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:07:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:07:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:07:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:07:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:07:28,671][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:07:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:07:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:07:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:07:31,304][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:07:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:07:32,620][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:07:33,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:07:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:07:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:07:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:07:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:07:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:07:37,230][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:07:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:07:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:07:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:07:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:07:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:07:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:07:42,166][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:07:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:07:43,482][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:07:44,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:07:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:07:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:07:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:07:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:07:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:07:48,090][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:50,064][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:50,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:51,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:07:53,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:53,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:53,068][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:54,548][__main__][INFO] - Iteration 253 took 55s (13.53% Gen, 83.77% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 31m 15s. Estimated total time: 15h 17m 45s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 52s. [2026-03-25 18:07:54,551][__main__][INFO] - Starting iteration 253. [2026-03-25 18:07:54,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:07:54,556][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:59,610][__main__][INFO] - Number of regex retries in iteration 253: 0 [2026-03-25 18:07:59,611][__main__][INFO] - agents played in iteration 253 are Alice, Bob [2026-03-25 18:08:00,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:00,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:00,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:08:00,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:08:00,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:02,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:05,502][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:06,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:08:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:08:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:08:09,452][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:08:10,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:08:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:08:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:08:12,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:08:12,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:08:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:08:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:08:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:08:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:08:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:08:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:08:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:18,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:18,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:19,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:21,307][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:22,628][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:26,581][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:27,239][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:27,898][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:28,556][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:08:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:08:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:08:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:08:31,190][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:08:31,848][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:08:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:08:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:08:34,077][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:08:34,735][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:08:35,393][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:08:36,051][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:08:36,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:08:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:08:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:08:38,685][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:08:39,343][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:08:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:08:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:08:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:08:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:08:42,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:08:43,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:08:44,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:08:45,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:08:45,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:08:45,226][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:08:46,693][__main__][INFO] - Iteration 254 took 52s (9.70% Gen, 87.48% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 41m 37s. Estimated total time: 14h 28m 59s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 29s. [2026-03-25 18:08:46,696][__main__][INFO] - Starting iteration 254. [2026-03-25 18:08:46,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:08:46,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:08:51,565][__main__][INFO] - Number of regex retries in iteration 254: 0 [2026-03-25 18:08:51,567][__main__][INFO] - agents played in iteration 254 are Alice, Bob [2026-03-25 18:08:52,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:52,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:52,219][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:08:52,220][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:08:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:54,836][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:55,493][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:58,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:00,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:03,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:04,045][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:09:04,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:09:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:09:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:09:06,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:09:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:09:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:09:08,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:09:09,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:09:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:09:10,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:09:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:09:11,941][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:09:12,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:09:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:09:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:09:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:09:15,228][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:09:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:09:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:09:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:09:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:09:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:09:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:09:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:09:20,490][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:21,806][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:23,122][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:23,781][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:25,343][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:26,001][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:29,294][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:29,952][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:09:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:09:33,243][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:09:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:09:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:09:35,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:09:35,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:09:37,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:09:37,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:09:40,999][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:09:42,323][__main__][INFO] - Iteration 255 took 55s (8.75% Gen, 88.87% Train). Generation: 4s, Training: 49s. Estimated remaining time: 11h 38m 46s. Estimated total time: 15h 27m 4s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 42s, 500 more iterations: 7h 43m 32s. [2026-03-25 18:09:42,326][__main__][INFO] - Starting iteration 255. [2026-03-25 18:09:42,330][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:09:42,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:47,270][__main__][INFO] - Number of regex retries in iteration 255: 0 [2026-03-25 18:09:47,271][__main__][INFO] - agents played in iteration 255 are Alice, Bob [2026-03-25 18:09:47,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:48,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:48,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:09:48,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:09:48,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:49,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:50,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:51,261][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:09:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:09:53,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:09:53,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:09:54,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:09:55,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:58,503][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:59,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:59,819][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:04,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:07,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:07,795][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:10:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:10:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:10:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:10:10,426][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:10:11,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:10:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:10:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:10:13,058][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:10:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:10:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:10:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:10:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:10:16,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:10:17,007][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:10:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:10:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:10:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:10:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:10:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:10:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:10:21,927][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:10:22,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:10:23,243][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:10:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:10:24,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:10:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:10:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:10:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:10:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:10:27,851][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:29,825][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:31,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:32,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:10:33,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:33,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:33,195][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:34,456][__main__][INFO] - Iteration 256 took 52s (9.48% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 39m 37s. Estimated total time: 14h 28m 48s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 24s. [2026-03-25 18:10:34,459][__main__][INFO] - Starting iteration 256. [2026-03-25 18:10:34,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:10:34,464][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:39,261][__main__][INFO] - Number of regex retries in iteration 256: 0 [2026-03-25 18:10:39,262][__main__][INFO] - agents played in iteration 256 are Alice, Bob [2026-03-25 18:10:39,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:10:39,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:10:39,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:10:39,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:10:40,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:10:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:10:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:10:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:10:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:44,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:45,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:47,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:49,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:10:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:10:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:53,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:54,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:56,977][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:58,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:04,873][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:06,189][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:06,847][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:11:08,163][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:11:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:11:09,479][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:11:10,137][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:11:10,796][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:11:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:11:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:11:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:11:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:11:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:11:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:11:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:11:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:11:16,982][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:11:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:11:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:11:18,955][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:11:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:11:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:11:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:11:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:11:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:11:22,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:11:23,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:11:24,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:11:24,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:11:24,818][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:11:26,281][__main__][INFO] - Iteration 257 took 51s (9.26% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 33m 38s. Estimated total time: 14h 23m 40s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 22s, 500 more iterations: 7h 11m 50s. [2026-03-25 18:11:26,284][__main__][INFO] - Starting iteration 257. [2026-03-25 18:11:26,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:11:26,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:31,201][__main__][INFO] - Number of regex retries in iteration 257: 0 [2026-03-25 18:11:31,202][__main__][INFO] - agents played in iteration 257 are Alice, Bob [2026-03-25 18:11:31,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:11:31,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:11:31,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:11:31,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:11:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:33,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:35,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:36,382][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:11:39,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:11:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:11:40,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:11:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:11:42,304][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:42,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:11:44,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:11:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:11:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:11:46,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:11:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:11:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:11:48,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:11:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:11:50,208][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:11:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:11:51,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:52,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:53,505][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:54,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:54,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:55,483][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:58,776][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:01,411][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:02,729][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:03,387][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:04,306][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:05,621][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:06,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:12:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:12:08,915][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:12:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:12:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:12:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:12:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:12:12,203][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:12:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:12:13,519][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:12:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:12:14,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:12:15,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:12:16,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:12:16,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:12:16,797][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:12:18,006][__main__][INFO] - Iteration 258 took 51s (9.50% Gen, 88.15% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 31m 6s. Estimated total time: 14h 22m 0s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 12s, 500 more iterations: 7h 11m 0s. [2026-03-25 18:12:18,009][__main__][INFO] - Starting iteration 258. [2026-03-25 18:12:18,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:12:18,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:12:22,960][__main__][INFO] - Number of regex retries in iteration 258: 0 [2026-03-25 18:12:22,961][__main__][INFO] - agents played in iteration 258 are Alice, Bob [2026-03-25 18:12:23,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:12:23,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:12:23,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:12:23,655][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:12:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:12:24,993][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:12:25,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:12:26,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:12:26,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:12:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:12:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:12:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:12:29,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:12:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:12:30,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:31,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:32,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:34,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:38,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:41,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:42,101][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:12:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:12:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:12:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:12:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:12:46,048][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:12:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:12:47,363][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:12:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:12:48,678][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:12:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:12:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:12:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:12:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:55,255][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:56,802][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:58,775][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:01,407][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:02,065][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:04,697][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:05,355][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:06,013][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:06,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:07,368][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 18:13:08,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:13:08,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:13:08,761][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:13:10,070][__main__][INFO] - Iteration 259 took 52s (9.50% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 35m 53s. Estimated total time: 14h 27m 39s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 49s. [2026-03-25 18:13:10,073][__main__][INFO] - Starting iteration 259. [2026-03-25 18:13:10,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:13:10,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:13:15,242][__main__][INFO] - Number of regex retries in iteration 259: 0 [2026-03-25 18:13:15,250][__main__][INFO] - agents played in iteration 259 are Alice, Bob [2026-03-25 18:13:15,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:13:15,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:13:15,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:13:15,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:13:16,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:13:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:13:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:13:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:19,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:21,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:13:22,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:13:22,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:13:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:13:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:13:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:13:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:13:26,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:13:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:13:27,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:13:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:13:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:13:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:31,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:34,841][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:35,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:38,798][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:39,457][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:40,116][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:40,776][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:41,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:42,755][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:43,414][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:44,074][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:45,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:46,054][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:13:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:13:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:13:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:13:49,657][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:13:50,316][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:13:50,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:13:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:52,295][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:55,593][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:56,255][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:56,916][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:58,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:59,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:14:00,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:00,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:00,999][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:02,362][__main__][INFO] - Iteration 260 took 52s (9.89% Gen, 87.50% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 38m 47s. Estimated total time: 14h 31m 25s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 42s. [2026-03-25 18:14:02,365][__main__][INFO] - Starting iteration 260. [2026-03-25 18:14:02,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:14:02,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:07,215][__main__][INFO] - Number of regex retries in iteration 260: 0 [2026-03-25 18:14:07,217][__main__][INFO] - agents played in iteration 260 are Alice, Bob [2026-03-25 18:14:07,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:14:07,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:14:07,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:14:07,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:14:08,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:14:09,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:14:09,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:14:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:14:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:14:11,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:14:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:14:13,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:14:13,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:15,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:16,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:14:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:14:17,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:14:18,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:14:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:14:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:14:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:14:21,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:14:21,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:14:22,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:14:23,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:14:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:14:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:14:25,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:14:25,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:14:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:14:26,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:14:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:14:28,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:28,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:30,930][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:31,587][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:32,246][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:32,903][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:33,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:34,221][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:35,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:14:36,195][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:14:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:14:37,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:14:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:14:38,828][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:39,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:40,475][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:41,792][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:42,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:44,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:47,716][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:48,377][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:14:51,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:14:51,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:14:53,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:53,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:53,179][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:54,714][__main__][INFO] - Iteration 261 took 52s (9.26% Gen, 87.81% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 38m 56s. Estimated total time: 14h 32m 26s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 13s. [2026-03-25 18:14:54,717][__main__][INFO] - Starting iteration 261. [2026-03-25 18:14:54,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:14:54,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:59,869][__main__][INFO] - Number of regex retries in iteration 261: 0 [2026-03-25 18:14:59,870][__main__][INFO] - agents played in iteration 261 are Alice, Bob [2026-03-25 18:15:00,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:00,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:00,549][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:15:00,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:15:01,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:04,457][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:05,115][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:06,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:07,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:15:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:15:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:11,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:15:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:15:13,011][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:15:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:15:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:15:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:15:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:15:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:15:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:15:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:15:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:15:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:15:19,590][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:15:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:15:20,905][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:15:21,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:15:22,220][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:15:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:15:23,536][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:15:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:15:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:15:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:15:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:15:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:15:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:15:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:28,797][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:29,454][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:30,112][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:15:36,285][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:15:36,943][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:15:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:15:38,259][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:15:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:15:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:15:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:15:40,890][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:15:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:15:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:15:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:43,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:44,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:15:45,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:45,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:45,350][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:46,942][__main__][INFO] - Iteration 262 took 52s (9.86% Gen, 87.09% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 36m 0s. Estimated total time: 14h 30m 23s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 11s. [2026-03-25 18:15:46,952][__main__][INFO] - Starting iteration 262. [2026-03-25 18:15:46,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:15:46,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:15:51,971][__main__][INFO] - Number of regex retries in iteration 262: 0 [2026-03-25 18:15:51,972][__main__][INFO] - agents played in iteration 262 are Alice, Bob [2026-03-25 18:15:52,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:52,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:52,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:15:52,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:15:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:55,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:55,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:56,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:57,153][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:57,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:04,390][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:06,364][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:07,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:16:07,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:16:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:16:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:16:09,653][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:16:10,311][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:16:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:16:11,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:16:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:16:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:16:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:16:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:16:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:16:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:16:16,232][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:16:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:16:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:16:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:16:18,866][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:16:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:16:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:16:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:16:21,498][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:16:22,156][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:16:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:16:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:16:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:16:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:16:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:16:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:16:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:16:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:32,267][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:16:32,924][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:16:33,582][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:16:34,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:16:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:16:35,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:16:36,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:16:37,694][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:16:37,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:16:37,704][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:16:39,301][__main__][INFO] - Iteration 263 took 52s (9.57% Gen, 87.37% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 37m 7s. Estimated total time: 14h 32m 22s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 11s. [2026-03-25 18:16:39,304][__main__][INFO] - Starting iteration 263. [2026-03-25 18:16:39,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:16:39,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:44,349][__main__][INFO] - Number of regex retries in iteration 263: 0 [2026-03-25 18:16:44,350][__main__][INFO] - agents played in iteration 263 are Alice, Bob [2026-03-25 18:16:45,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:16:45,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:16:45,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:16:45,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:16:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:49,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:50,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:51,012][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:57,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:58,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:58,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:00,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:04,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:05,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:17:08,140][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:17:08,799][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:17:09,457][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:17:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:17:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:17:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:17:12,088][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:17:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:17:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:17:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:17:14,719][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:17:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:17:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:17:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:17:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:17:18,357][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:17:19,015][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:17:19,673][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:17:20,331][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:17:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:17:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:17:22,305][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:17:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:17:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:17:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:17:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:25,593][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:28,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:28,993][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:17:30,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:30,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:30,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:17:31,985][__main__][INFO] - Iteration 264 took 52s (9.57% Gen, 86.98% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 41m 50s. Estimated total time: 14h 37m 58s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 47s, 500 more iterations: 7h 18m 59s. [2026-03-25 18:17:31,988][__main__][INFO] - Starting iteration 264. [2026-03-25 18:17:31,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:17:31,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:17:36,925][__main__][INFO] - Number of regex retries in iteration 264: 0 [2026-03-25 18:17:36,926][__main__][INFO] - agents played in iteration 264 are Alice, Bob [2026-03-25 18:17:37,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:17:37,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:17:37,641][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:17:37,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:17:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:17:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:17:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:17:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:17:40,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:17:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:17:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:17:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:17:43,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:17:44,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:17:44,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:17:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:46,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:47,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:48,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:48,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:49,436][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:50,093][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:54,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:54,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:55,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:59,301][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:00,616][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:02,591][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:18:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:18:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:18:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:18:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:18:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:18:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:18:12,717][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:18:13,375][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:18:14,033][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:18:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:18:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:18:16,008][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:18:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:18:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:18:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:18:18,642][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:18:19,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:18:19,958][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:18:20,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:18:21,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:18:22,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:18:22,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:18:22,536][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:24,598][__main__][INFO] - Iteration 265 took 52s (9.32% Gen, 86.70% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 39m 47s. Estimated total time: 14h 36m 48s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 40s, 500 more iterations: 7h 18m 24s. [2026-03-25 18:18:24,601][__main__][INFO] - Starting iteration 265. [2026-03-25 18:18:24,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:18:24,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:29,692][__main__][INFO] - Number of regex retries in iteration 265: 0 [2026-03-25 18:18:29,694][__main__][INFO] - agents played in iteration 265 are Alice, Bob [2026-03-25 18:18:30,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:30,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:30,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:18:30,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:18:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:18:31,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:18:32,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:18:32,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:18:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:18:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:18:34,957][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:18:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:18:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:18:36,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:18:37,593][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:18:38,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:18:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:18:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:18:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:18:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:18:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:18:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:18:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:18:43,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:18:44,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:18:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:18:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:18:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:18:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:18:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:18:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:48,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:50,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:50,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:52,078][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:53,394][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:54,051][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:54,709][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:57,344][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:58,002][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:01,950][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:02,959][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:03,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:04,276][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:04,935][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:06,251][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:07,567][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:09,543][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:11,517][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:19:12,174][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:19:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:19:13,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:19:14,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:19:15,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:19:15,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:19:15,305][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:19:17,643][__main__][INFO] - Iteration 266 took 53s (9.59% Gen, 85.99% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 46m 5s. Estimated total time: 14h 43m 58s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 23s, 500 more iterations: 7h 21m 59s. [2026-03-25 18:19:17,646][__main__][INFO] - Starting iteration 266. [2026-03-25 18:19:17,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:19:17,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:19:22,649][__main__][INFO] - Number of regex retries in iteration 266: 0 [2026-03-25 18:19:22,650][__main__][INFO] - agents played in iteration 266 are Alice, Bob [2026-03-25 18:19:23,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:19:23,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:19:23,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:19:23,452][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:19:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:24,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:25,427][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:26,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:28,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:28,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:30,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:19:31,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:19:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:19:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:19:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:19:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:19:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:19:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:19:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:19:36,628][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:19:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:19:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:19:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:19:39,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:19:39,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:19:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:19:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:19:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:19:42,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:19:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:19:43,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:19:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:47,173][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:19:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:19:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:19:49,149][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:19:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:19:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:19:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:19:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:19:52,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:19:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:56,040][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:01,963][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:02,620][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:03,939][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:05,918][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:06,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:07,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:20:08,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:08,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:08,453][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:09,911][__main__][INFO] - Iteration 267 took 52s (9.57% Gen, 87.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 32m 17s. Estimated total time: 14h 31m 2s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 31s. [2026-03-25 18:20:09,913][__main__][INFO] - Starting iteration 267. [2026-03-25 18:20:09,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:20:09,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:20:11,382][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2026-03-25 18:20:15,265][__main__][INFO] - Number of regex retries in iteration 267: 1 [2026-03-25 18:20:15,266][__main__][INFO] - agents played in iteration 267 are Alice, Bob [2026-03-25 18:20:15,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:20:16,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:20:16,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:20:16,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:20:16,738][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:20:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:20:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:20:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:20:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:20:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:20:20,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:20:21,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:20:21,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:20:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:20:23,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:23,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:25,938][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:27,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:29,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:20:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:20:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:20:31,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:20:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:20:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:20:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:20:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:20:34,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:20:35,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:20:35,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:20:36,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:20:37,150][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:20:37,808][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:20:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:20:39,127][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:20:39,786][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:42,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:20:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:20:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:20:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:20:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:20:46,379][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:20:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:20:47,697][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:20:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:20:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:20:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:20:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:20:51,233][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:20:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:20:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:53,210][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:56,509][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:57,826][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:59,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:59,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:21:01,117][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:01,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:01,122][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:02,617][__main__][INFO] - Iteration 268 took 52s (10.15% Gen, 87.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 38m 43s. Estimated total time: 14h 38m 22s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 11s. [2026-03-25 18:21:02,620][__main__][INFO] - Starting iteration 268. [2026-03-25 18:21:02,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:21:02,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:07,656][__main__][INFO] - Number of regex retries in iteration 268: 0 [2026-03-25 18:21:07,658][__main__][INFO] - agents played in iteration 268 are Alice, Bob [2026-03-25 18:21:08,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:21:08,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:21:08,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:21:08,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:21:09,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:11,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:15,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:21:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:21:18,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:21:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:21:19,518][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:21:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:21:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:21:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:22,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:23,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:21:28,087][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:21:28,748][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:21:29,408][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:21:30,070][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:21:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:21:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:21:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:21:32,705][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:21:33,363][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:21:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:21:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:21:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:36,653][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:21:37,311][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:21:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:21:38,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:21:39,286][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:21:39,943][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:21:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:21:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:21:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:21:42,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:21:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:21:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:21:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:21:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:21:46,198][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:21:46,856][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:21:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:21:48,172][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:21:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:21:49,487][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:21:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:21:50,803][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:21:51,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:21:52,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:21:53,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:53,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:53,473][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:54,890][__main__][INFO] - Iteration 269 took 52s (9.63% Gen, 87.66% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 30m 36s. Estimated total time: 14h 31m 7s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 33s. [2026-03-25 18:21:54,893][__main__][INFO] - Starting iteration 269. [2026-03-25 18:21:54,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:21:54,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:59,770][__main__][INFO] - Number of regex retries in iteration 269: 0 [2026-03-25 18:21:59,772][__main__][INFO] - agents played in iteration 269 are Alice, Bob [2026-03-25 18:22:00,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:00,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:00,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:22:00,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:22:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:06,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:07,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:22:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:22:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:22:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:09,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:10,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:10,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:12,283][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:13,599][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:14,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:15,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:19,519][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:20,833][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:21,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:24,126][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:26,763][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:22:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:22:28,734][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:30,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:31,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:22:32,020][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:22:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:22:33,660][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:22:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:22:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:22:35,635][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:22:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:22:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:22:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:22:38,268][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:22:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:22:39,585][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:22:40,243][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:22:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:22:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:22:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:22:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:22:43,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:22:44,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:22:45,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:22:45,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:22:45,541][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:22:46,902][__main__][INFO] - Iteration 270 took 52s (9.37% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 25m 24s. Estimated total time: 14h 26m 47s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 23s. [2026-03-25 18:22:46,904][__main__][INFO] - Starting iteration 270. [2026-03-25 18:22:46,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:22:46,909][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:22:52,280][__main__][INFO] - Number of regex retries in iteration 270: 0 [2026-03-25 18:22:52,281][__main__][INFO] - agents played in iteration 270 are Alice, Bob [2026-03-25 18:22:52,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:52,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:52,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:22:52,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:22:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:54,229][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:56,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:58,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:00,147][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:04,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:23:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:23:08,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:23:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:23:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:23:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:23:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:23:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:23:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:23:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:14,621][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:17,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:17,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:19,887][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:21,860][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:23,176][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:23,833][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:24,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:26,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:23:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:23:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:23:28,760][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:23:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:23:30,076][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:23:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:23:31,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:23:32,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:23:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:23:33,365][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:23:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:23:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:23:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:23:35,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:23:36,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:23:38,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:23:38,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:23:38,089][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:23:39,422][__main__][INFO] - Iteration 271 took 52s (10.23% Gen, 87.23% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 33m 0s. Estimated total time: 14h 35m 15s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 31s, 500 more iterations: 7h 17m 37s. [2026-03-25 18:23:39,425][__main__][INFO] - Starting iteration 271. [2026-03-25 18:23:39,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:23:39,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:23:44,503][__main__][INFO] - Number of regex retries in iteration 271: 0 [2026-03-25 18:23:44,505][__main__][INFO] - agents played in iteration 271 are Alice, Bob [2026-03-25 18:23:45,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:23:45,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:23:45,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:23:45,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:23:45,923][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:23:46,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:23:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:23:47,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:23:48,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:23:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:23:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:23:50,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:23:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:23:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:53,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:55,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:57,727][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:01,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:01,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:02,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:03,644][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:24:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:24:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:24:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:24:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:24:06,930][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:24:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:24:08,245][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:24:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:24:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:24:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:24:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:24:11,532][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:24:12,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:24:14,819][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:24:15,476][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:24:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:24:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:24:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:19,749][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:20,406][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:21,063][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:23,038][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:23,695][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:25,668][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:26,327][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:28,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:29,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:24:30,372][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:24:30,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:24:30,376][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:24:31,848][__main__][INFO] - Iteration 272 took 52s (9.68% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 30m 33s. Estimated total time: 14h 33m 41s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 50s. [2026-03-25 18:24:31,851][__main__][INFO] - Starting iteration 272. [2026-03-25 18:24:31,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:24:31,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:24:40,766][__main__][INFO] - Number of regex retries in iteration 272: 0 [2026-03-25 18:24:40,768][__main__][INFO] - agents played in iteration 272 are Alice, Bob [2026-03-25 18:24:41,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:24:41,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:24:41,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:24:41,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:24:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:24:42,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:24:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:24:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:24:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:24:45,392][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:24:46,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:24:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:24:47,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:24:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:24:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:24:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:24:49,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:24:50,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:24:51,306][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:24:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:24:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:24:53,277][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:24:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:24:54,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:24:55,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:24:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:57,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:58,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:03,797][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:25:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:25:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:25:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:25:08,397][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:25:09,054][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:25:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:25:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:13,989][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:14,646][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:15,963][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:25:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:25:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:25:17,940][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:19,256][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:19,917][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:25:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:25:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:25:21,889][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:25:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:25:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:25:23,862][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:25:24,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:25:25,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:25:26,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:26,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:26,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:27,882][__main__][INFO] - Iteration 273 took 56s (15.91% Gen, 81.29% Train). Generation: 8s, Training: 45s. Estimated remaining time: 11h 29m 44s. Estimated total time: 15h 33m 48s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 54s. [2026-03-25 18:25:27,886][__main__][INFO] - Starting iteration 273. [2026-03-25 18:25:27,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:25:27,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:34,732][__main__][INFO] - Number of regex retries in iteration 273: 0 [2026-03-25 18:25:34,733][__main__][INFO] - agents played in iteration 273 are Alice, Bob [2026-03-25 18:25:35,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:25:35,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:25:35,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:25:35,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:25:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:25:36,698][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:25:37,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:25:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:25:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:25:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:25:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:25:40,649][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:25:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:25:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:25:42,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:25:43,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:25:43,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:25:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:25:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:25:45,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:25:46,586][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:25:47,245][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:25:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:25:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:25:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:25:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:25:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:25:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:25:51,861][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:25:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:25:53,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:25:53,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:55,808][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:57,787][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:58,446][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:00,420][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:01,736][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:02,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:03,711][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:06,342][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:26:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:26:08,584][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:26:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:26:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:10,559][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:26:17,799][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:26:18,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:26:19,212][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:26:20,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:26:20,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:26:20,371][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:26:21,953][__main__][INFO] - Iteration 274 took 54s (12.66% Gen, 84.41% Train). Generation: 6s, Training: 45s. Estimated remaining time: 10h 56m 7s. Estimated total time: 15h 1m 5s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 32s. [2026-03-25 18:26:21,955][__main__][INFO] - Starting iteration 274. [2026-03-25 18:26:21,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:26:21,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:26:26,984][__main__][INFO] - Number of regex retries in iteration 274: 0 [2026-03-25 18:26:26,985][__main__][INFO] - agents played in iteration 274 are Alice, Bob [2026-03-25 18:26:27,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:26:27,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:26:27,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:26:27,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:26:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:30,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:32,223][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:32,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:33,541][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:34,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:37,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:40,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:26:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:26:41,444][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:26:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:26:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:26:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:26:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:26:44,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:26:45,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:26:46,052][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:26:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:26:47,369][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:26:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:26:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:26:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:26:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:26:50,659][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:26:51,317][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:26:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:54,607][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:00,135][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:01,452][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:04,083][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:05,402][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:06,059][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:08,695][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:10,012][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:10,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:11,426][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:27:12,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:12,599][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:12,600][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:14,034][__main__][INFO] - Iteration 275 took 52s (9.65% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 22m 7s. Estimated total time: 14h 27m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 58s. [2026-03-25 18:27:14,037][__main__][INFO] - Starting iteration 275. [2026-03-25 18:27:14,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:27:14,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:27:18,897][__main__][INFO] - Number of regex retries in iteration 275: 0 [2026-03-25 18:27:18,898][__main__][INFO] - agents played in iteration 275 are Alice, Bob [2026-03-25 18:27:19,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:27:19,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:27:19,564][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:27:19,565][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:27:20,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:27:20,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:27:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:27:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:27:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:27:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:27:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:27:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:27:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:27:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:27:26,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:27:27,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:27:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:27:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:27:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:27:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:27:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:27:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:27:32,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:36,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:39,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:41,268][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:41,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:42,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:43,243][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:27:43,901][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:27:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:27:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:27:45,875][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:27:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:27:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:27:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:27:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:27:49,166][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:27:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:27:50,482][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:27:51,139][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:56,047][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:57,365][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:00,655][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:01,971][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:02,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:03,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:28:04,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:04,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:04,644][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:07,221][__main__][INFO] - Iteration 276 took 53s (9.13% Gen, 86.02% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 39m 38s. Estimated total time: 14h 46m 21s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 38s, 500 more iterations: 7h 23m 10s. [2026-03-25 18:28:07,224][__main__][INFO] - Starting iteration 276. [2026-03-25 18:28:07,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:28:07,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:12,111][__main__][INFO] - Number of regex retries in iteration 276: 0 [2026-03-25 18:28:12,113][__main__][INFO] - agents played in iteration 276 are Alice, Bob [2026-03-25 18:28:12,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:12,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:12,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:28:12,774][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:28:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:28:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:28:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:28:16,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:28:16,675][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:28:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:28:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:28:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:28:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:28:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:28:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:28:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:28:21,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:28:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:28:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:28:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:28:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:28:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:28:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:28:26,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:28:27,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:28:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:28:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:28:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:28:29,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:28:30,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:28:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:28:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:28:32,456][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:28:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:28:33,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:28:34,428][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:28:35,086][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:28:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:28:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:37,716][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:39,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:40,347][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:42,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:44,294][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:45,181][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:47,814][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:28:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:28:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:28:49,786][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:28:50,444][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:28:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:28:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:28:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:28:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:53,732][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:54,390][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:55,047][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:55,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:56,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:28:57,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:57,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:57,726][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:59,299][__main__][INFO] - Iteration 277 took 52s (9.37% Gen, 87.60% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 20m 15s. Estimated total time: 14h 27m 50s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 55s. [2026-03-25 18:28:59,303][__main__][INFO] - Starting iteration 277. [2026-03-25 18:28:59,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:28:59,308][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:04,177][__main__][INFO] - Number of regex retries in iteration 277: 0 [2026-03-25 18:29:04,178][__main__][INFO] - agents played in iteration 277 are Alice, Bob [2026-03-25 18:29:04,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:04,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:04,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:29:04,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:29:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:10,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:12,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:29:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:29:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:29:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:29:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:29:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:29:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:29:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:29:17,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:29:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:29:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:29:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:29:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:29:20,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:29:21,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:29:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:29:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:29:23,190][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:29:23,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:29:24,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:29:25,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:29:25,821][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:29:26,481][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:29:27,139][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:29:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:29:28,454][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:29:29,111][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:29:29,769][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:29:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:29:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:29:31,741][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:29:32,398][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:29:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:29:33,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:29:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:29:35,028][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:29:35,686][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:29:36,343][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:29:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:29:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:29:38,611][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:29:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:29:39,926][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:41,898][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:43,870][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:46,501][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:47,159][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:47,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:48,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:29:49,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:49,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:49,634][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:29:51,084][__main__][INFO] - Iteration 278 took 51s (9.41% Gen, 87.79% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 14m 32s. Estimated total time: 14h 22m 59s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 17s, 500 more iterations: 7h 11m 29s. [2026-03-25 18:29:51,087][__main__][INFO] - Starting iteration 278. [2026-03-25 18:29:51,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:29:51,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:56,504][__main__][INFO] - Number of regex retries in iteration 278: 0 [2026-03-25 18:29:56,506][__main__][INFO] - agents played in iteration 278 are Alice, Bob [2026-03-25 18:29:57,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:57,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:57,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:29:57,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:29:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:59,917][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:30:00,574][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:30:01,231][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:30:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:30:02,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:30:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:30:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:30:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:05,834][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:06,491][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:30:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:30:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:30:08,463][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:30:09,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:30:09,777][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:30:10,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:30:11,091][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:30:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:30:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:30:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:30:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:30:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:30:15,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:30:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:30:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:30:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:30:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:30:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:30:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:30:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:30:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:30:20,953][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:30:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:30:22,267][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:30:22,925][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:30:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:30:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:30:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:30:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:30:26,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:30:26,873][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:30:27,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:30:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:30:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:30:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:30:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:30:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:30:31,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:30:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:30:33,075][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:30:33,733][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:30:34,390][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:30:35,048][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:30:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:30:36,363][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:30:37,021][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:30:37,679][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:30:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:30:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:30:39,652][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:30:40,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:30:41,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:30:49,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:30:49,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:30:49,672][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:30:51,238][__main__][INFO] - Iteration 279 took 1m 0s (9.00% Gen, 88.39% Train). Generation: 5s, Training: 53s. Estimated remaining time: 12h 33m 2s. Estimated total time: 16h 42m 29s. Time estimates for 10 more iterations: 10m 1s, 100 more iterations: 1h 40m 14s, 500 more iterations: 8h 21m 14s. [2026-03-25 18:30:51,242][__main__][INFO] - Starting iteration 279. [2026-03-25 18:30:51,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:30:51,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:56,453][__main__][INFO] - Number of regex retries in iteration 279: 0 [2026-03-25 18:30:56,455][__main__][INFO] - agents played in iteration 279 are Alice, Bob [2026-03-25 18:30:57,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:30:57,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:30:57,158][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:30:57,159][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:30:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:59,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:59,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:01,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:02,356][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:04,327][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:04,984][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:05,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:06,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:08,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:08,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:31:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:31:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:31:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:31:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:31:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:31:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:31:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:31:17,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:31:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:31:18,793][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:31:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:31:20,108][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:31:20,765][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:31:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:31:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:31:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:31:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:31:24,055][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:31:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:31:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:31:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:31:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:31:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:31:27,998][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:31:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:31:29,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:31:30,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:31:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:31:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:31:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:31:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:31:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:31:34,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:31:34,811][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:31:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:31:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:31:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:31:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:31:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:31:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:31:39,413][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:31:40,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:31:40,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:31:42,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:31:42,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:31:42,106][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:43,777][__main__][INFO] - Iteration 280 took 52s (9.91% Gen, 86.90% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 25m 10s. Estimated total time: 14h 35m 30s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 33s, 500 more iterations: 7h 17m 45s. [2026-03-25 18:31:43,779][__main__][INFO] - Starting iteration 280. [2026-03-25 18:31:43,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:31:43,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:31:48,683][__main__][INFO] - Number of regex retries in iteration 280: 0 [2026-03-25 18:31:48,684][__main__][INFO] - agents played in iteration 280 are Alice, Bob [2026-03-25 18:31:49,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:31:49,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:31:49,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:31:49,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:31:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:31:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:31:51,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:31:51,979][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:52,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:53,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:55,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:55,929][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:57,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:57,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:58,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:59,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:00,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:04,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:05,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:05,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:07,137][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:07,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:08,456][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:09,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:11,090][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:12,407][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:32:15,037][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:32:15,695][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:32:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:32:17,011][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:32:17,669][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:32:18,328][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:32:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:32:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:32:20,301][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:32:20,958][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:32:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:32:22,571][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:32:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:32:23,887][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:32:24,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:32:25,202][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:32:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:32:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:32:27,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:32:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:32:28,495][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:32:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:32:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:32:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:32:31,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:32:31,786][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:32:32,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:32:33,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:32:34,295][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:32:34,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:32:34,300][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:32:35,768][__main__][INFO] - Iteration 281 took 51s (9.42% Gen, 87.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 15m 15s. Estimated total time: 14h 26m 27s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 13s. [2026-03-25 18:32:35,772][__main__][INFO] - Starting iteration 281. [2026-03-25 18:32:35,776][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:32:35,777][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:32:40,773][__main__][INFO] - Number of regex retries in iteration 281: 0 [2026-03-25 18:32:40,775][__main__][INFO] - agents played in iteration 281 are Alice, Bob [2026-03-25 18:32:41,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:32:41,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:32:41,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:32:41,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:32:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:32:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:32:43,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:32:43,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:32:44,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:32:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:32:45,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:32:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:32:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:32:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:32:48,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:32:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:32:49,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:32:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:32:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:32:51,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:55,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:57,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:59,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:00,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:03,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:03,740][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:05,055][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:06,372][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:07,030][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:08,347][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:09,664][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:10,980][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:12,296][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:15,170][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:16,485][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:17,801][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:33:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:33:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:33:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:33:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:33:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:33:21,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:33:22,407][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:33:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:33:23,723][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:33:24,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:33:25,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:33:26,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:33:26,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:33:26,265][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:33:27,540][__main__][INFO] - Iteration 282 took 51s (9.66% Gen, 87.88% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 10m 42s. Estimated total time: 14h 22m 46s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 23s. [2026-03-25 18:33:27,543][__main__][INFO] - Starting iteration 282. [2026-03-25 18:33:27,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:33:27,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:32,629][__main__][INFO] - Number of regex retries in iteration 282: 0 [2026-03-25 18:33:32,631][__main__][INFO] - agents played in iteration 282 are Alice, Bob [2026-03-25 18:33:33,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:33:33,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:33:33,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:33:33,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:33:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:33:34,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:33:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:33:36,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:33:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:33:37,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:33:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:33:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:33:39,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:33:40,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:33:40,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:33:41,348][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:33:42,009][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:33:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:33:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:33:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:33:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:33:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:33:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:33:46,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:33:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:33:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:33:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:33:49,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:33:49,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:33:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:33:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:33:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:53,202][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:54,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:55,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:57,152][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:02,417][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:03,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:05,049][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:07,319][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:07,977][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:34:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:11,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:16,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:17,239][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:34:18,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:18,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:18,442][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:19,887][__main__][INFO] - Iteration 283 took 52s (9.70% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 19m 24s. Estimated total time: 14h 32m 20s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 10s. [2026-03-25 18:34:19,890][__main__][INFO] - Starting iteration 283. [2026-03-25 18:34:19,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:34:19,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:34:24,734][__main__][INFO] - Number of regex retries in iteration 283: 0 [2026-03-25 18:34:24,735][__main__][INFO] - agents played in iteration 283 are Alice, Bob [2026-03-25 18:34:25,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:34:25,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:34:25,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:34:25,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:34:26,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:34:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:34:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:34:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:34:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:34:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:34:32,901][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:34:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:34:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:34:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:34:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:34:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:34:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:34:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:34:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:34:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:34:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:34:41,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:34:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:34:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:34:43,553][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:34:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:34:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:34:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:34:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:34:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:34:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:34:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:34:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:34:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:34:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:34:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:34:51,442][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:34:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:34:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:34:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:56,044][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:56,702][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:57,359][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:35:00,224][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:35:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:35:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:35:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:35:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:35:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:35:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:35:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:35:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:35:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:35:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:35:07,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:35:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:35:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:35:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:35:10,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:35:10,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:35:11,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:35:11,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:35:11,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:35:14,069][__main__][INFO] - Iteration 284 took 54s (8.93% Gen, 87.10% Train). Generation: 4s, Training: 47s. Estimated remaining time: 10h 49m 7s. Estimated total time: 15h 2m 57s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 17s, 500 more iterations: 7h 31m 28s. [2026-03-25 18:35:14,072][__main__][INFO] - Starting iteration 284. [2026-03-25 18:35:14,077][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:35:14,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:35:18,931][__main__][INFO] - Number of regex retries in iteration 284: 0 [2026-03-25 18:35:18,933][__main__][INFO] - agents played in iteration 284 are Alice, Bob [2026-03-25 18:35:19,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:35:19,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:35:19,507][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:35:19,508][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:35:20,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:35:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:35:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:35:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:35:22,776][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:35:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:35:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:35:24,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:35:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:35:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:35:26,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:35:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:35:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:35:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:35:29,359][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:35:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:35:30,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:35:31,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:35:31,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:35:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:35:33,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:35:33,963][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:35:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:35:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:35:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:35:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:35:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:35:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:35:38,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:35:39,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:35:39,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:35:40,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:35:41,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:35:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:35:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:35:43,171][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:35:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:35:44,486][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:35:45,143][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:35:45,801][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:35:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:35:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:35:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:35:48,433][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:35:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:35:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:35:50,408][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:35:51,066][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:35:52,030][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:35:52,689][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:35:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:35:54,006][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:35:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:35:55,322][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:35:55,980][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:35:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:35:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:35:57,954][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:35:58,612][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:35:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:35:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:36:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:36:01,244][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:36:01,902][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:36:02,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:36:03,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:36:04,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:36:04,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:36:04,470][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:36:05,956][__main__][INFO] - Iteration 285 took 51s (9.36% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 9m 59s. Estimated total time: 14h 24m 41s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 20s. [2026-03-25 18:36:05,958][__main__][INFO] - Starting iteration 285. [2026-03-25 18:36:05,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:36:05,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:36:11,037][__main__][INFO] - Number of regex retries in iteration 285: 0 [2026-03-25 18:36:11,038][__main__][INFO] - agents played in iteration 285 are Alice, Bob [2026-03-25 18:36:11,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:36:11,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:36:11,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:36:11,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:36:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:36:12,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:36:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:36:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:36:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:36:15,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:36:16,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:36:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:36:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:36:18,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:36:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:36:19,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:36:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:36:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:36:21,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:36:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:36:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:36:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:36:24,179][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:36:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:36:25,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:36:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:36:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:36:27,473][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:36:28,132][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:36:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:36:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:36:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:36:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:36:31,426][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:36:32,085][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:36:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:36:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:36:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:36:34,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:36:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:36:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:36:36,694][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:36:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:36:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:36:38,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:36:39,328][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:36:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:36:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:36:41,303][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:36:41,961][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:36:42,620][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:36:43,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:36:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:36:44,844][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:36:45,507][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:36:46,172][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:36:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:36:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:36:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:36:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:36:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:36:50,138][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:36:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:36:51,458][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:36:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:36:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:36:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:36:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:36:54,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:36:55,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:36:56,833][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:36:56,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:36:56,838][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:37:00,279][__main__][INFO] - Iteration 286 took 54s (9.34% Gen, 84.32% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 49m 41s. Estimated total time: 15h 5m 18s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 31s, 500 more iterations: 7h 32m 39s. [2026-03-25 18:37:00,282][__main__][INFO] - Starting iteration 286. [2026-03-25 18:37:00,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:37:00,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:37:09,217][__main__][INFO] - Number of regex retries in iteration 286: 0 [2026-03-25 18:37:09,219][__main__][INFO] - agents played in iteration 286 are Alice, Bob [2026-03-25 18:37:09,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:37:09,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:37:09,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:37:09,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:37:10,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:37:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:37:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:37:12,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:37:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:37:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:37:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:37:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:37:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:37:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:37:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:37:17,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:37:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:37:19,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:37:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:37:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:37:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:37:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:37:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:37:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:37:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:37:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:37:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:37:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:37:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:37:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:37:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:37:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:37:28,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:37:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:37:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:37:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:37:31,615][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:37:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:37:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:37:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:37:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:37:34,907][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:37:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:37:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:37:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:37:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:37:38,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:37:38,863][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:37:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:37:40,178][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:37:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:37:41,495][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:37:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:37:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:37:43,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:37:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:37:45,119][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:37:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:37:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:37:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:37:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:37:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:37:49,071][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:37:49,730][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:37:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:37:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:37:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:37:52,363][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:37:53,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:37:53,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:37:54,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:37:54,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:37:54,996][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:37:56,549][__main__][INFO] - Iteration 287 took 56s (15.87% Gen, 81.36% Train). Generation: 8s, Training: 45s. Estimated remaining time: 11h 21m 12s. Estimated total time: 15h 37m 45s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 52s. [2026-03-25 18:37:56,551][__main__][INFO] - Starting iteration 287. [2026-03-25 18:37:56,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:37:56,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:38:01,510][__main__][INFO] - Number of regex retries in iteration 287: 0 [2026-03-25 18:38:01,511][__main__][INFO] - agents played in iteration 287 are Alice, Bob [2026-03-25 18:38:02,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:02,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:02,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:38:02,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:38:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:38:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:38:04,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:38:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:38:05,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:38:06,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:38:06,785][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:38:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:38:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:38:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:38:09,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:38:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:38:10,754][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:38:11,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:38:12,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:38:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:38:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:38:14,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:38:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:38:15,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:38:16,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:38:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:38:17,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:38:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:38:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:38:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:38:20,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:38:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:38:21,346][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:38:22,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:38:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:38:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:38:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:38:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:38:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:38:25,974][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:38:26,634][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:38:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:38:27,955][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:38:28,617][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:38:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:38:29,938][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:38:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:38:31,261][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:38:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:38:32,580][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:38:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:38:33,902][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:38:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:38:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:38:36,221][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:38:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:38:37,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:38:38,205][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:38:38,869][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:38:39,530][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:38:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:38:40,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:38:41,515][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:38:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:38:42,838][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:38:43,498][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:38:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:38:44,819][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:38:45,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:38:46,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:38:47,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:38:47,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:38:47,514][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:38:48,930][__main__][INFO] - Iteration 288 took 52s (9.46% Gen, 87.83% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 15m 31s. Estimated total time: 14h 32m 56s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 28s. [2026-03-25 18:38:48,933][__main__][INFO] - Starting iteration 288. [2026-03-25 18:38:48,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:38:48,940][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:38:58,281][__main__][INFO] - Number of regex retries in iteration 288: 0 [2026-03-25 18:38:58,283][__main__][INFO] - agents played in iteration 288 are Alice, Bob [2026-03-25 18:38:58,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:58,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:58,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:38:58,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:38:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:39:00,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:39:00,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:39:01,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:39:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:39:02,901][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:39:03,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:39:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:39:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:39:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:39:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:39:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:39:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:39:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:39:08,840][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:39:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:39:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:39:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:39:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:39:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:39:12,800][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:39:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:39:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:39:14,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:39:15,440][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:39:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:39:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:39:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:39:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:39:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:39:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:39:20,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:39:20,717][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:39:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:39:22,035][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:39:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:39:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:39:24,013][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:39:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:39:25,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:39:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:39:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:39:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:39:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:39:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:39:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:39:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:39:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:39:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:39:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:39:32,822][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:39:33,480][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:39:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:39:34,797][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:39:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:39:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:39:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:39:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:39:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:39:38,751][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:39:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:39:40,068][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:39:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:39:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:39:42,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:39:42,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:39:44,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:39:44,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:39:44,087][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:39:45,453][__main__][INFO] - Iteration 289 took 56s (16.53% Gen, 81.04% Train). Generation: 9s, Training: 45s. Estimated remaining time: 11h 23m 36s. Estimated total time: 15h 41m 57s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 58s. [2026-03-25 18:39:45,456][__main__][INFO] - Starting iteration 289. [2026-03-25 18:39:45,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:39:45,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:39:50,426][__main__][INFO] - Number of regex retries in iteration 289: 0 [2026-03-25 18:39:50,428][__main__][INFO] - agents played in iteration 289 are Alice, Bob [2026-03-25 18:39:50,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:51,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:51,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:39:51,009][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:39:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:39:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:39:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:39:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:39:54,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:39:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:39:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:39:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:39:56,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:39:57,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:39:58,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:39:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:39:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:40:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:40:00,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:40:01,502][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:40:02,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:40:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:40:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:40:04,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:40:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:40:05,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:40:06,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:40:06,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:40:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:40:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:40:08,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:40:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:40:10,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:40:10,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:40:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:40:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:40:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:40:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:40:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:40:14,665][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:40:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:40:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:40:16,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:40:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:40:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:40:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:40:19,272][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:40:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:40:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:40:21,246][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:40:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:40:22,562][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:40:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:40:24,111][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:40:24,770][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:40:25,428][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:40:26,087][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:40:26,746][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:40:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:40:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:40:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:40:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:40:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:40:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:40:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:40:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:40:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:40:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:40:33,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:40:34,682][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:40:35,916][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:40:35,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:40:35,925][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:40:37,419][__main__][INFO] - Iteration 290 took 51s (9.56% Gen, 87.56% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 6m 47s. Estimated total time: 14h 26m 1s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 0s. [2026-03-25 18:40:37,422][__main__][INFO] - Starting iteration 290. [2026-03-25 18:40:37,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:40:37,427][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:40:42,354][__main__][INFO] - Number of regex retries in iteration 290: 0 [2026-03-25 18:40:42,356][__main__][INFO] - agents played in iteration 290 are Alice, Bob [2026-03-25 18:40:43,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:40:43,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:40:43,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:40:43,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:40:43,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:40:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:40:45,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:40:45,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:40:46,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:40:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:40:47,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:40:48,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:40:48,971][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:40:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:40:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:40:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:40:51,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:40:52,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:40:52,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:40:53,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:40:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:40:54,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:40:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:40:56,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:40:56,862][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:40:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:40:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:40:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:40:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:41:00,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:41:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:41:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:41:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:41:02,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:41:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:41:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:41:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:41:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:41:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:41:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:41:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:41:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:41:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:41:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:41:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:41:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:41:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:41:11,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:41:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:41:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:41:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:41:14,620][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:41:15,597][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:41:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:41:16,911][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:41:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:41:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:41:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:41:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:41:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:41:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:41:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:41:22,170][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:41:22,827][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:41:23,485][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:41:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:41:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:41:25,457][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:41:26,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:41:26,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:41:27,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:41:27,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:41:27,973][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:41:29,237][__main__][INFO] - Iteration 291 took 51s (9.51% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 3m 27s. Estimated total time: 14h 23m 33s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 46s. [2026-03-25 18:41:29,240][__main__][INFO] - Starting iteration 291. [2026-03-25 18:41:29,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:41:29,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:41:34,835][__main__][INFO] - Number of regex retries in iteration 291: 0 [2026-03-25 18:41:34,836][__main__][INFO] - agents played in iteration 291 are Alice, Bob [2026-03-25 18:41:35,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:41:35,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:41:35,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:41:35,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:41:36,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:41:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:41:37,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:41:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:41:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:41:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:41:39,947][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:41:40,604][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:41:41,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:41:41,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:41:42,577][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:41:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:41:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:41:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:41:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:41:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:41:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:41:47,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:41:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:41:48,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:41:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:41:49,810][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:41:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:41:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:41:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:41:52,440][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:41:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:41:53,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:41:54,414][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:41:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:41:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:41:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:41:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:41:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:41:58,362][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:41:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:41:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:42:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:42:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:42:01,649][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:42:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:42:02,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:42:03,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:42:04,278][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:42:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:42:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:42:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:42:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:42:07,796][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:42:08,454][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:42:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:42:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:42:10,427][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:42:11,084][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:42:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:42:12,399][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:42:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:42:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:42:14,371][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:42:15,028][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:42:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:42:16,344][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:42:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:42:17,658][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:42:18,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:42:19,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 18:42:20,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:42:20,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:42:20,188][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:42:21,616][__main__][INFO] - Iteration 292 took 52s (10.67% Gen, 86.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 11m 55s. Estimated total time: 14h 32m 53s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 26s. [2026-03-25 18:42:21,619][__main__][INFO] - Starting iteration 292. [2026-03-25 18:42:21,625][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:42:21,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:42:26,585][__main__][INFO] - Number of regex retries in iteration 292: 0 [2026-03-25 18:42:26,587][__main__][INFO] - agents played in iteration 292 are Alice, Bob [2026-03-25 18:42:27,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:42:27,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:42:27,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:42:27,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:42:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:42:28,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:42:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:42:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:42:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:42:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:42:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:42:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:42:33,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:42:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:42:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:42:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:42:36,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:42:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:42:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:42:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:42:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:42:39,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:42:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:42:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:42:41,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:42:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:42:42,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:42:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:42:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:42:44,549][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:42:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:42:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:42:46,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:42:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:42:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:42:48,493][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:42:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:42:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:42:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:42:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:42:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:42:52,437][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:42:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:42:53,754][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:42:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:42:55,067][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:42:55,726][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:42:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:42:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:42:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:42:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:42:59,014][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:42:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:43:00,588][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:43:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:43:01,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:43:02,562][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:43:03,221][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:43:03,878][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:43:04,537][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:43:05,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:43:05,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:43:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:43:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:43:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:43:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:43:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:43:09,801][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:43:10,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:43:11,152][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 18:43:12,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:43:12,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:43:12,348][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:43:15,620][__main__][INFO] - Iteration 293 took 53s (9.19% Gen, 84.75% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 38m 6s. Estimated total time: 14h 59m 57s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 59s, 500 more iterations: 7h 29m 58s. [2026-03-25 18:43:15,622][__main__][INFO] - Starting iteration 293. [2026-03-25 18:43:15,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:43:15,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:43:20,627][__main__][INFO] - Number of regex retries in iteration 293: 0 [2026-03-25 18:43:20,629][__main__][INFO] - agents played in iteration 293 are Alice, Bob [2026-03-25 18:43:21,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:43:21,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:43:21,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:43:21,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:43:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:43:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:43:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:43:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:43:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:43:25,253][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:43:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:43:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:43:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:43:27,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:43:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:43:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:43:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:43:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:43:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:43:31,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:43:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:43:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:43:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:43:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:43:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:43:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:43:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:43:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:43:37,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:43:38,430][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:43:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:43:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:43:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:43:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:43:41,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:43:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:43:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:43:43,706][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:43:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:43:45,025][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:43:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:43:46,344][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:43:47,004][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:43:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:43:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:43:48,985][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:43:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:43:50,304][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:43:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:43:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:43:52,282][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:43:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:43:53,890][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:43:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:43:55,208][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:43:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:43:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:43:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:43:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:43:58,506][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:43:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:43:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:44:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:44:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:44:01,800][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:44:02,460][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:44:03,118][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:44:03,776][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:44:04,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:44:05,167][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:44:06,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:44:06,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:44:06,414][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:44:08,847][__main__][INFO] - Iteration 294 took 53s (9.40% Gen, 86.03% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 24m 17s. Estimated total time: 14h 47m 2s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 42s, 500 more iterations: 7h 23m 31s. [2026-03-25 18:44:08,850][__main__][INFO] - Starting iteration 294. [2026-03-25 18:44:08,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:44:08,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:44:13,916][__main__][INFO] - Number of regex retries in iteration 294: 0 [2026-03-25 18:44:13,917][__main__][INFO] - agents played in iteration 294 are Alice, Bob [2026-03-25 18:44:14,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:44:14,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:44:14,653][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:44:14,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:44:15,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:44:15,966][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:44:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:44:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:44:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:44:18,607][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:44:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:44:19,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:44:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:44:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:44:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:44:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:44:23,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:44:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:44:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:44:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:44:25,873][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:44:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:44:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:44:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:44:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:44:29,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:44:29,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:44:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:44:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:44:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:44:32,481][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:44:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:44:33,799][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:44:34,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:44:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:44:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:44:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:44:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:44:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:44:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:44:39,091][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:44:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:44:40,418][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:44:41,077][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:44:41,737][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:44:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:44:43,057][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:44:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:44:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:44:45,040][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:44:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:44:46,361][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:44:47,359][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:44:48,018][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:44:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:44:49,340][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:44:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:44:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:44:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:44:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:44:52,645][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:44:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:44:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:44:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:44:55,287][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:44:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:44:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:44:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:44:57,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:44:58,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:45:00,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:45:00,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:45:00,058][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:45:01,355][__main__][INFO] - Iteration 295 took 52s (9.64% Gen, 87.88% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 11m 26s. Estimated total time: 14h 35m 3s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 31s. [2026-03-25 18:45:01,358][__main__][INFO] - Starting iteration 295. [2026-03-25 18:45:01,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:45:01,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:45:06,221][__main__][INFO] - Number of regex retries in iteration 295: 0 [2026-03-25 18:45:06,222][__main__][INFO] - agents played in iteration 295 are Alice, Bob [2026-03-25 18:45:06,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:07,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:07,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:45:07,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:45:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:45:08,339][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:45:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:45:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:45:10,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:45:10,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:45:11,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:45:12,299][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:45:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:45:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:45:14,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:45:14,943][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:45:15,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:45:16,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:45:16,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:45:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:45:18,245][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:45:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:45:19,566][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:45:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:45:20,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:45:21,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:45:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:45:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:45:23,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:45:24,184][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:45:24,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:45:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:45:26,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:45:26,821][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:45:27,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:45:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:45:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:45:29,461][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:45:30,120][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:45:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:45:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:45:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:45:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:45:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:45:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:45:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:45:35,397][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:45:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:45:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:45:37,376][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:45:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:45:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:45:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:45:40,376][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:45:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:45:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:45:42,354][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:45:43,013][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:45:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:45:44,332][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:45:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:45:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:45:46,311][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:45:46,970][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:45:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:45:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:45:48,949][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:45:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:45:50,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:45:50,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:45:52,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:45:52,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:45:52,336][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:45:53,756][__main__][INFO] - Iteration 296 took 52s (9.27% Gen, 88.01% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 8m 45s. Estimated total time: 14h 33m 15s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 37s. [2026-03-25 18:45:53,760][__main__][INFO] - Starting iteration 296. [2026-03-25 18:45:53,765][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:45:53,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:45:58,810][__main__][INFO] - Number of regex retries in iteration 296: 0 [2026-03-25 18:45:58,811][__main__][INFO] - agents played in iteration 296 are Alice, Bob [2026-03-25 18:45:59,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:59,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:59,608][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:45:59,609][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:46:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:46:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:46:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:46:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:46:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:46:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:46:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:46:04,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:46:05,488][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:46:06,147][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:46:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:46:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:46:08,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:46:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:46:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:46:10,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:46:10,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:46:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:46:12,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:46:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:46:13,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:46:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:46:14,718][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:46:15,376][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:46:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:46:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:46:17,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:46:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:46:18,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:46:19,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:46:19,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:46:20,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:46:21,307][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:46:21,965][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:46:22,623][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:46:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:46:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:46:24,599][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:46:25,258][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:46:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:46:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:46:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:46:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:46:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:46:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:46:29,872][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:46:30,531][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:46:31,189][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:46:32,109][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:46:32,770][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:46:33,430][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:46:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:46:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:46:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:46:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:46:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:46:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:46:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:46:38,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:46:39,362][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:46:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:46:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:46:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:46:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:46:42,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:46:43,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:46:46,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:46:46,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:46:46,360][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:46:47,957][__main__][INFO] - Iteration 297 took 54s (9.31% Gen, 87.74% Train). Generation: 5s, Training: 47s. Estimated remaining time: 10h 37m 50s. Estimated total time: 15h 3m 14s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 19s, 500 more iterations: 7h 31m 37s. [2026-03-25 18:46:47,960][__main__][INFO] - Starting iteration 297. [2026-03-25 18:46:47,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:46:47,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:46:52,837][__main__][INFO] - Number of regex retries in iteration 297: 0 [2026-03-25 18:46:52,839][__main__][INFO] - agents played in iteration 297 are Alice, Bob [2026-03-25 18:46:53,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:46:53,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:46:53,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:46:53,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:46:54,177][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:46:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:46:55,455][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:46:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:46:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:46:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:46:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:46:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:46:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:47:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:47:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:47:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:47:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:47:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:47:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:47:04,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:47:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:47:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:47:05,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:47:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:47:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:47:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:47:08,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:47:09,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:47:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:47:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:47:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:47:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:47:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:47:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:47:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:47:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:47:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:47:15,891][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:47:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:47:17,209][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:47:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:47:18,530][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:47:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:47:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:47:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:47:21,167][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:47:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:47:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:47:23,144][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:47:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:47:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:47:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:47:26,122][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:47:26,781][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:47:27,440][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:47:28,098][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:47:28,756][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:47:29,414][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:47:30,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:47:30,731][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:47:31,389][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:47:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:47:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:47:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:47:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:47:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:47:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:47:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:47:36,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:47:37,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:47:38,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:47:38,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:47:38,623][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:47:39,990][__main__][INFO] - Iteration 298 took 52s (9.37% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 0m 52s. Estimated total time: 14h 27m 8s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 34s. [2026-03-25 18:47:39,993][__main__][INFO] - Starting iteration 298. [2026-03-25 18:47:39,997][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:47:39,997][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:47:45,284][__main__][INFO] - Number of regex retries in iteration 298: 0 [2026-03-25 18:47:45,286][__main__][INFO] - agents played in iteration 298 are Alice, Bob [2026-03-25 18:47:45,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:47:45,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:47:45,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:47:45,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:47:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:47:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:47:47,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:47:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:47:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:47:49,857][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:47:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:47:51,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:47:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:47:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:47:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:47:53,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:47:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:47:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:47:55,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:47:56,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:47:57,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:47:57,745][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:47:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:47:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:47:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:48:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:48:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:48:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:48:02,345][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:48:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:48:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:48:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:48:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:48:05,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:48:06,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:48:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:48:07,603][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:48:08,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:48:08,917][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:48:09,575][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:48:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:48:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:48:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:48:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:48:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:48:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:48:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:48:14,842][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:48:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:48:16,158][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:48:16,816][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:48:17,474][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:48:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:48:19,084][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:48:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:48:20,405][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:48:21,063][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:48:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:48:22,380][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:48:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:48:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:48:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:48:25,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:48:25,671][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:48:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:48:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:48:27,649][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:48:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:48:28,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:48:29,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:48:31,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:48:31,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:48:31,022][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:48:32,337][__main__][INFO] - Iteration 299 took 52s (10.10% Gen, 87.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 5m 13s. Estimated total time: 14h 32m 22s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 11s. [2026-03-25 18:48:32,341][__main__][INFO] - Starting iteration 299. [2026-03-25 18:48:32,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:48:32,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:48:37,427][__main__][INFO] - Number of regex retries in iteration 299: 0 [2026-03-25 18:48:37,428][__main__][INFO] - agents played in iteration 299 are Alice, Bob [2026-03-25 18:48:37,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:38,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:38,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:48:38,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:48:38,694][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:48:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:48:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:48:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:48:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:48:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:48:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:48:43,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:48:43,929][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:48:44,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:48:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:48:45,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:48:46,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:48:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:48:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:48:48,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:48:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:48:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:48:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:48:51,167][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:48:51,825][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:48:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:48:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:48:53,798][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:48:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:48:55,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:48:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:48:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:48:57,092][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:48:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:48:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:48:59,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:48:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:49:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:49:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:49:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:49:02,363][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:49:03,022][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:49:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:49:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:49:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:49:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:49:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:49:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:49:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:49:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:49:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:49:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:49:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:49:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:49:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:49:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:49:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:49:13,882][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:49:14,540][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:49:15,202][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:49:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:49:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:49:17,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:49:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:49:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:49:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:49:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:49:20,468][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:49:21,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:49:21,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:49:23,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:49:23,004][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:49:23,005][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:49:25,461][__main__][INFO] - Iteration 300 took 53s (9.57% Gen, 85.81% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 17m 15s. Estimated total time: 14h 45m 17s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 38s. [2026-03-25 18:49:25,463][__main__][INFO] - Starting iteration 300. [2026-03-25 18:49:25,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:49:25,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:49:30,635][__main__][INFO] - Number of regex retries in iteration 300: 0 [2026-03-25 18:49:30,637][__main__][INFO] - agents played in iteration 300 are Alice, Bob [2026-03-25 18:49:31,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:49:31,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:49:31,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:49:31,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:49:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:49:32,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:49:33,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:49:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:49:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:49:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:49:35,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:49:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:49:37,207][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:49:37,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:49:38,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:49:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:49:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:49:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:49:41,164][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:49:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:49:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:49:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:49:43,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:49:44,461][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:49:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:49:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:49:46,441][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:49:47,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:49:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:49:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:49:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:49:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:49:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:49:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:49:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:49:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:49:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:49:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:49:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:49:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:49:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:49:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:49:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:49:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:49:58,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:49:58,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:49:59,623][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:50:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:50:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:50:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:50:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:50:02,917][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:50:03,841][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:50:04,500][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:50:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:50:05,816][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:50:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:50:07,131][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:50:07,790][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:50:08,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:50:09,107][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:50:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:50:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:50:11,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:50:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:50:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:50:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:50:13,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:50:14,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:50:15,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:50:16,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:50:16,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:50:16,261][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:50:22,934][__main__][INFO] - Iteration 301 took 57s (9.00% Gen, 79.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 28m 49s. Estimated total time: 15h 57m 48s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 46s, 500 more iterations: 7h 58m 54s. [2026-03-25 18:50:22,937][__main__][INFO] - Starting iteration 301. [2026-03-25 18:50:22,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:50:22,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:50:27,862][__main__][INFO] - Number of regex retries in iteration 301: 0 [2026-03-25 18:50:27,863][__main__][INFO] - agents played in iteration 301 are Alice, Bob [2026-03-25 18:50:28,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:50:28,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:50:28,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:50:28,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:50:29,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:50:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:50:30,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:50:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:50:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:50:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:50:33,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:50:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:50:34,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:50:35,048][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:50:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:50:36,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:50:37,022][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:50:37,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:50:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:50:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:50:39,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:50:40,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:50:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:50:41,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:50:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:50:42,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:50:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:50:44,259][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:50:44,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:50:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:50:46,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:50:46,893][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:50:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:50:48,209][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:50:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:50:49,524][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:50:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:50:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:50:51,497][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:50:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:50:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:50:53,470][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:50:54,127][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:50:54,784][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:50:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:50:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:50:56,757][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:50:57,415][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:50:58,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:50:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:50:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:51:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:51:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:51:01,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:51:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:51:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:51:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:51:04,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:51:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:51:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:51:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:51:06,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:51:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:51:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:51:08,864][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:51:09,523][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:51:10,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:51:10,840][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:51:11,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:51:12,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:51:13,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:51:13,416][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:51:13,417][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:51:14,644][__main__][INFO] - Iteration 302 took 51s (9.52% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 51m 54s. Estimated total time: 14h 21m 45s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 52s. [2026-03-25 18:51:14,647][__main__][INFO] - Starting iteration 302. [2026-03-25 18:51:14,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:51:14,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:51:19,575][__main__][INFO] - Number of regex retries in iteration 302: 0 [2026-03-25 18:51:19,576][__main__][INFO] - agents played in iteration 302 are Alice, Bob [2026-03-25 18:51:20,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:51:20,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:51:20,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:51:20,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:51:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:51:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:51:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:51:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:51:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:51:24,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:51:24,838][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:51:25,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:51:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:51:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:51:27,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:51:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:51:28,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:51:29,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:51:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:51:30,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:51:31,423][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:51:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:51:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:51:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:51:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:51:34,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:51:35,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:51:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:51:36,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:51:37,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:51:38,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:51:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:51:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:51:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:51:40,642][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:51:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:51:41,959][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:51:42,617][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:51:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:51:43,934][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:51:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:51:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:51:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:51:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:51:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:51:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:51:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:51:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:51:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:51:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:51:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:51:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:51:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:51:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:51:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:51:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:51:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:51:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:51:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:51:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:51:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:51:58,726][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:51:59,384][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:52:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:52:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:52:01,358][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:52:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:52:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:52:03,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:52:04,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:52:05,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:52:05,361][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:52:05,362][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:52:09,806][__main__][INFO] - Iteration 303 took 55s (8.93% Gen, 83.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 48m 32s. Estimated total time: 15h 19m 17s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 38s. [2026-03-25 18:52:09,810][__main__][INFO] - Starting iteration 303. [2026-03-25 18:52:09,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:52:09,814][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:52:14,946][__main__][INFO] - Number of regex retries in iteration 303: 0 [2026-03-25 18:52:14,947][__main__][INFO] - agents played in iteration 303 are Alice, Bob [2026-03-25 18:52:15,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:52:15,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:52:15,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:52:15,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:52:16,296][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:52:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:52:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:52:18,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:52:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:52:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:52:20,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:52:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:52:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:52:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:52:22,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:52:23,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:52:24,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:52:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:52:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:52:26,119][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:52:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:52:27,437][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:52:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:52:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:52:29,411][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:52:30,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:52:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:52:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:52:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:52:32,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:52:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:52:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:52:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:52:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:52:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:52:36,654][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:52:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:52:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:52:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:52:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:52:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:52:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:52:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:52:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:52:42,582][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:52:43,240][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:52:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:52:44,557][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:52:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:52:45,875][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:52:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:52:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:52:48,084][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:52:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:52:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:52:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:52:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:52:51,377][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:52:52,036][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:52:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:52:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:52:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:52:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:52:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:52:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:52:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:52:57,307][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:52:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:52:58,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:52:59,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:53:00,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:53:00,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:53:00,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:53:02,376][__main__][INFO] - Iteration 304 took 52s (9.76% Gen, 86.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 4m 26s. Estimated total time: 14h 36m 4s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 2s. [2026-03-25 18:53:02,379][__main__][INFO] - Starting iteration 304. [2026-03-25 18:53:02,383][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:53:02,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:53:13,492][__main__][INFO] - Number of regex retries in iteration 304: 0 [2026-03-25 18:53:13,494][__main__][INFO] - agents played in iteration 304 are Alice, Bob [2026-03-25 18:53:14,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:53:14,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:53:14,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:53:14,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:53:14,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:53:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:53:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:53:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:53:17,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:53:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:53:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:53:19,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:53:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:53:20,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:53:21,408][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:53:22,066][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:53:22,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:53:23,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:53:24,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:53:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:53:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:53:26,018][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:53:26,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:53:27,335][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:53:27,994][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:53:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:53:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:53:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:53:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:53:31,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:53:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:53:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:53:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:53:33,919][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:53:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:53:35,235][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:53:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:53:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:53:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:53:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:53:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:53:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:53:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:53:40,504][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:53:41,162][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:53:41,820][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:53:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:53:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:53:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:53:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:53:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:53:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:53:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:53:47,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:53:48,027][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:53:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:53:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:53:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:53:50,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:53:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:53:51,977][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:53:52,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:53:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:53:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:53:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:53:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:53:55,929][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:53:56,590][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:53:57,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:53:58,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:53:59,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:53:59,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:53:59,223][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:54:00,531][__main__][INFO] - Iteration 305 took 58s (19.11% Gen, 78.64% Train). Generation: 11s, Training: 45s. Estimated remaining time: 11h 36m 33s. Estimated total time: 16h 9m 10s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 55s, 500 more iterations: 8h 4m 35s. [2026-03-25 18:54:00,533][__main__][INFO] - Starting iteration 305. [2026-03-25 18:54:00,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:54:00,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:54:06,262][__main__][INFO] - Number of regex retries in iteration 305: 0 [2026-03-25 18:54:06,264][__main__][INFO] - agents played in iteration 305 are Alice, Bob [2026-03-25 18:54:07,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:07,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:07,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:54:07,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:54:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:54:08,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:54:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:54:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:54:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:54:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:54:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:54:12,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:54:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:54:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:54:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:54:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:54:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:54:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:54:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:54:17,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:54:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:54:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:54:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:54:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:54:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:54:21,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:54:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:54:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:54:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:54:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:54:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:54:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:54:26,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:54:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:54:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:54:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:54:28,858][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:54:29,516][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:54:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:54:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:54:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:54:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:54:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:54:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:54:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:54:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:54:35,441][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:54:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:54:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:54:37,420][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:54:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:54:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:54:39,731][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:54:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:54:41,050][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:54:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:54:42,366][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:54:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:54:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:54:44,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:54:44,997][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:54:45,655][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:54:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:54:46,970][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:54:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:54:48,285][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:54:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:54:49,600][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:54:50,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:54:51,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:54:52,177][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:54:52,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:54:52,181][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:54:53,675][__main__][INFO] - Iteration 306 took 53s (10.78% Gen, 86.41% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 12m 9s. Estimated total time: 14h 45m 39s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 33s, 500 more iterations: 7h 22m 49s. [2026-03-25 18:54:53,677][__main__][INFO] - Starting iteration 306. [2026-03-25 18:54:53,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:54:53,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:54:58,635][__main__][INFO] - Number of regex retries in iteration 306: 0 [2026-03-25 18:54:58,636][__main__][INFO] - agents played in iteration 306 are Alice, Bob [2026-03-25 18:54:59,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:59,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:59,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:54:59,284][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:54:59,995][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:55:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:55:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:55:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:55:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:55:03,236][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:55:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:55:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:55:05,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:55:05,866][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:55:06,523][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:55:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:55:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:55:08,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:55:09,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:55:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:55:10,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:55:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:55:11,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:55:12,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:55:13,099][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:55:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:55:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:55:15,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:55:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:55:16,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:55:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:55:17,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:55:18,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:55:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:55:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:55:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:55:20,998][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:55:21,656][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:55:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:55:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:55:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:55:24,287][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:55:24,945][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:55:25,604][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:55:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:55:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:55:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:55:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:55:28,894][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:55:29,551][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:55:30,208][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:55:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:55:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:55:32,516][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:55:33,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:55:33,831][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:55:34,489][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:55:35,147][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:55:35,806][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:55:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:55:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:55:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:55:38,437][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:55:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:55:39,753][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:55:40,411][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:55:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:55:41,727][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:55:42,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:55:43,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:55:44,360][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:55:44,363][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:55:44,365][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:55:46,181][__main__][INFO] - Iteration 307 took 52s (9.43% Gen, 87.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 0m 39s. Estimated total time: 14h 35m 2s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 31s. [2026-03-25 18:55:46,184][__main__][INFO] - Starting iteration 307. [2026-03-25 18:55:46,188][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:55:46,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:55:58,620][__main__][INFO] - Number of regex retries in iteration 307: 0 [2026-03-25 18:55:58,621][__main__][INFO] - agents played in iteration 307 are Alice, Bob [2026-03-25 18:55:59,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:55:59,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:55:59,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:55:59,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:55:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:56:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:56:01,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:56:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:56:02,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:56:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:56:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:56:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:56:05,188][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:56:05,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:56:06,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:56:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:56:07,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:56:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:56:09,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:56:09,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:56:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:56:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:56:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:56:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:56:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:56:13,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:56:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:56:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:56:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:56:16,378][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:56:17,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:56:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:56:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:56:19,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:56:19,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:56:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:56:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:56:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:56:22,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:56:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:56:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:56:24,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:56:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:56:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:56:26,255][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:56:26,913][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:56:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:56:28,229][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:56:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:56:29,545][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:56:30,203][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:56:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:56:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:56:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:56:33,149][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:56:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:56:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:56:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:56:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:56:36,438][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:56:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:56:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:56:38,412][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:56:39,071][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:56:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:56:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:56:41,045][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:56:41,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:56:42,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:56:43,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:56:44,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:56:44,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:56:44,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:56:45,569][__main__][INFO] - Iteration 308 took 59s (20.94% Gen, 76.95% Train). Generation: 12s, Training: 45s. Estimated remaining time: 11h 54m 21s. Estimated total time: 16h 29m 43s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 58s, 500 more iterations: 8h 14m 51s. [2026-03-25 18:56:45,572][__main__][INFO] - Starting iteration 308. [2026-03-25 18:56:45,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:56:45,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:56:50,480][__main__][INFO] - Number of regex retries in iteration 308: 0 [2026-03-25 18:56:50,481][__main__][INFO] - agents played in iteration 308 are Alice, Bob [2026-03-25 18:56:51,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:56:51,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:56:51,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:56:51,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:56:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:56:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:56:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:56:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:56:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:56:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:56:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:56:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:56:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:56:57,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:56:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:56:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:56:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:57:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:57:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:57:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:57:02,323][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:57:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:57:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:57:04,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:57:04,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:57:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:57:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:57:06,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:57:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:57:08,248][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:57:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:57:09,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:57:10,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:57:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:57:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:57:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:57:12,854][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:57:13,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:57:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:57:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:57:15,487][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:57:16,145][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:57:16,802][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:57:17,461][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:57:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:57:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:57:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:57:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:57:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:57:21,409][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:57:22,067][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:57:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:57:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:57:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:57:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:57:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:57:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:57:26,963][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:57:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:57:28,279][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:57:28,937][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:57:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:57:30,256][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:57:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:57:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:57:32,229][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:57:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:57:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:57:34,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:57:34,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:57:36,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:57:36,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:57:36,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:57:37,327][__main__][INFO] - Iteration 309 took 51s (9.48% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 46m 20s. Estimated total time: 14h 22m 33s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 15s, 500 more iterations: 7h 11m 16s. [2026-03-25 18:57:37,329][__main__][INFO] - Starting iteration 309. [2026-03-25 18:57:37,334][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:57:37,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:57:42,387][__main__][INFO] - Number of regex retries in iteration 309: 0 [2026-03-25 18:57:42,388][__main__][INFO] - agents played in iteration 309 are Alice, Bob [2026-03-25 18:57:43,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:57:43,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:57:43,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:57:43,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:57:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:57:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:57:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:57:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:57:46,400][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:57:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:57:47,716][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:57:48,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:57:49,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:57:49,690][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:57:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:57:51,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:57:51,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:57:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:57:52,982][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:57:53,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:57:54,298][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:57:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:57:55,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:57:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:57:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:57:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:57:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:57:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:57:59,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:58:00,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:58:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:58:01,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:58:02,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:58:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:58:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:58:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:58:04,832][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:58:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:58:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:58:06,806][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:58:07,463][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:58:08,122][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:58:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:58:09,439][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:58:10,097][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:58:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:58:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:58:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:58:12,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:58:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:58:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:58:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:58:15,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:58:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:58:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:58:17,575][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:58:18,234][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:58:18,892][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:58:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:58:20,207][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:58:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:58:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:58:22,180][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:58:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:58:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:58:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:58:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:58:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:58:26,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:58:26,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:58:27,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:58:27,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:58:27,941][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:58:29,246][__main__][INFO] - Iteration 310 took 51s (9.73% Gen, 87.75% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 48m 9s. Estimated total time: 14h 25m 14s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 37s. [2026-03-25 18:58:29,249][__main__][INFO] - Starting iteration 310. [2026-03-25 18:58:29,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:58:29,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:58:34,126][__main__][INFO] - Number of regex retries in iteration 310: 0 [2026-03-25 18:58:34,127][__main__][INFO] - agents played in iteration 310 are Alice, Bob [2026-03-25 18:58:34,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:58:34,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:58:34,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:58:34,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:58:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:58:36,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:58:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:58:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:58:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:58:38,758][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:58:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:58:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:58:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:58:41,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:58:42,048][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:58:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:58:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:58:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:58:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:58:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:58:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:58:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:58:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:58:47,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:58:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:58:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:58:49,942][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:58:50,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:58:51,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:58:51,915][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:58:52,572][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:58:53,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:58:53,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:58:54,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:58:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:58:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:58:56,521][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:58:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:58:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:58:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:58:59,151][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:58:59,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:59:00,466][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:59:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:59:01,780][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:59:02,437][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:59:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:59:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:59:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:59:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:59:05,724][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:59:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:59:07,354][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:59:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:59:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:59:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:59:09,987][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:59:10,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:59:11,302][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:59:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:59:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:59:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:59:13,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:59:14,590][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:59:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:59:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:59:16,562][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:59:17,220][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:59:17,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:59:18,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:59:19,774][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:59:19,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:59:19,778][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:59:21,103][__main__][INFO] - Iteration 311 took 51s (9.40% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 46m 15s. Estimated total time: 14h 24m 12s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 6s. [2026-03-25 18:59:21,105][__main__][INFO] - Starting iteration 311. [2026-03-25 18:59:21,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:59:21,109][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:59:25,996][__main__][INFO] - Number of regex retries in iteration 311: 0 [2026-03-25 18:59:25,997][__main__][INFO] - agents played in iteration 311 are Alice, Bob [2026-03-25 18:59:26,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:59:26,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:59:26,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:59:26,538][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:59:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:59:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:59:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:59:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:59:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:59:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:59:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:59:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:59:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:59:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:59:33,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:59:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:59:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:59:35,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:59:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:59:37,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:59:37,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:59:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:59:39,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:59:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:59:40,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:59:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:59:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:59:42,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:59:43,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:59:43,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:59:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:59:44,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:59:45,645][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:59:46,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:59:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:59:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:59:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:59:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:59:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:59:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:59:50,910][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:59:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:59:52,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:59:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:59:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:59:54,201][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:59:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:59:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:59:56,177][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:59:56,835][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:59:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:59:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:59:59,047][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:59:59,705][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:00:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:00:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:00:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:00:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:00:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:00:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:00:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:00:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:00:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:00:06,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:00:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:00:07,608][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:00:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:00:08,923][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:00:09,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:00:10,299][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:00:11,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:00:11,460][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:00:11,461][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:00:12,993][__main__][INFO] - Iteration 312 took 51s (9.42% Gen, 87.62% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 45m 57s. Estimated total time: 14h 24m 46s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 23s. [2026-03-25 19:00:12,996][__main__][INFO] - Starting iteration 312. [2026-03-25 19:00:13,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:00:13,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:00:17,915][__main__][INFO] - Number of regex retries in iteration 312: 0 [2026-03-25 19:00:17,916][__main__][INFO] - agents played in iteration 312 are Alice, Bob [2026-03-25 19:00:18,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:00:18,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:00:18,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:00:18,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:00:19,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:00:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:00:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:00:21,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:00:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:00:22,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:00:23,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:00:23,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:00:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:00:25,131][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:00:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:00:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:00:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:00:27,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:00:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:00:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:00:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:00:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:00:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:00:31,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:00:32,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:00:33,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:00:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:00:34,347][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:00:35,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:00:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:00:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:00:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:00:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:00:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:00:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:00:39,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:00:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:00:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:00:41,582][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:00:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:00:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:00:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:00:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:00:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:00:45,527][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:00:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:00:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:00:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:00:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:00:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:00:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:00:50,129][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:00:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:00:51,714][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:00:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:00:53,029][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:00:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:00:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:00:55,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:00:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:00:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:00:56,977][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:00:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:00:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:00:58,950][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:00:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:01:00,265][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:01:00,923][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:01:01,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:01:02,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:01:03,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:01:03,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:01:03,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:01:05,884][__main__][INFO] - Iteration 313 took 52s (9.29% Gen, 86.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 10h 1m 44s. Estimated total time: 14h 41m 26s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 43s. [2026-03-25 19:01:05,887][__main__][INFO] - Starting iteration 313. [2026-03-25 19:01:05,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:01:05,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:01:11,678][__main__][INFO] - Number of regex retries in iteration 313: 0 [2026-03-25 19:01:11,679][__main__][INFO] - agents played in iteration 313 are Alice, Bob [2026-03-25 19:01:12,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:12,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:12,227][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:01:12,228][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:01:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:01:13,499][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:01:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:01:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:01:15,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:01:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:01:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:01:17,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:01:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:01:18,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:01:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:01:20,078][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:01:20,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:01:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:01:22,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:01:22,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:01:23,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:01:24,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:01:24,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:01:25,339][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:01:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:01:26,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:01:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:01:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:01:28,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:01:29,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:01:29,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:01:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:01:31,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:01:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:01:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:01:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:01:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:01:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:01:35,216][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:01:35,873][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:01:36,531][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:01:37,189][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:01:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:01:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:01:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:01:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:01:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:01:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:01:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:01:42,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:01:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:01:43,768][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:01:44,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:01:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:01:45,976][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:01:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:01:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:01:47,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:01:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:01:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:01:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:01:50,584][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:01:51,243][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:01:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:01:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:01:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:01:53,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:01:54,535][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:01:55,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:01:55,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:01:57,030][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:01:57,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:01:57,035][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:01:58,545][__main__][INFO] - Iteration 314 took 52s (10.99% Gen, 86.13% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 57m 2s. Estimated total time: 14h 37m 36s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 48s. [2026-03-25 19:01:58,550][__main__][INFO] - Starting iteration 314. [2026-03-25 19:01:58,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:01:58,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:02:03,733][__main__][INFO] - Number of regex retries in iteration 314: 0 [2026-03-25 19:02:03,734][__main__][INFO] - agents played in iteration 314 are Alice, Bob [2026-03-25 19:02:04,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:04,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:04,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:02:04,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:02:05,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:02:05,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:02:06,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:02:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:02:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:02:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:02:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:02:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:02:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:02:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:02:11,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:02:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:02:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:02:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:02:14,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:02:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:02:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:02:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:02:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:02:17,531][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:02:18,190][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:02:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:02:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:02:20,166][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:02:20,825][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:02:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:02:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:02:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:02:23,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:02:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:02:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:02:25,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:02:26,092][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:02:26,751][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:02:27,409][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:02:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:02:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:02:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:02:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:02:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:02:31,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:02:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:02:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:02:33,334][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:02:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:02:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:02:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:02:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:02:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:02:37,551][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:02:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:02:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:02:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:02:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:02:40,845][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:02:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:02:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:02:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:02:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:02:44,135][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:02:44,794][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:02:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:02:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:02:46,770][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:02:47,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:02:48,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:02:49,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:02:49,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:02:49,341][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:02:50,695][__main__][INFO] - Iteration 315 took 52s (9.93% Gen, 87.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 47m 34s. Estimated total time: 14h 29m 1s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 30s. [2026-03-25 19:02:50,697][__main__][INFO] - Starting iteration 315. [2026-03-25 19:02:50,702][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:02:50,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:02:55,701][__main__][INFO] - Number of regex retries in iteration 315: 0 [2026-03-25 19:02:55,703][__main__][INFO] - agents played in iteration 315 are Alice, Bob [2026-03-25 19:02:56,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:56,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:56,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:02:56,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:02:57,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:02:57,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:02:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:02:59,091][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:02:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:03:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:03:01,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:03:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:03:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:03:03,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:03:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:03:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:03:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:03:05,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:03:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:03:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:03:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:03:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:03:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:03:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:03:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:03:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:03:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:03:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:03:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:03:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:03:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:03:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:03:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:03:16,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:03:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:03:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:03:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:03:18,839][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:03:19,497][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:03:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:03:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:03:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:03:22,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:03:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:03:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:03:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:03:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:03:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:03:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:03:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:03:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:03:28,056][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:03:29,051][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:03:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:03:30,367][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:03:31,026][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:03:31,684][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:03:32,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:03:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:03:33,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:03:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:03:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:03:35,636][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:03:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:03:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:03:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:03:38,270][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:03:38,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:03:39,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:03:40,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:03:41,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:03:41,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:03:41,554][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:03:43,258][__main__][INFO] - Iteration 316 took 52s (9.51% Gen, 87.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 53m 39s. Estimated total time: 14h 35m 58s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 59s. [2026-03-25 19:03:43,261][__main__][INFO] - Starting iteration 316. [2026-03-25 19:03:43,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:03:43,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:03:48,928][__main__][INFO] - Number of regex retries in iteration 316: 0 [2026-03-25 19:03:48,930][__main__][INFO] - agents played in iteration 316 are Alice, Bob [2026-03-25 19:03:49,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:03:49,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:03:49,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:03:49,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:03:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:03:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:03:51,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:03:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:03:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:03:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:03:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:03:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:03:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:03:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:03:56,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:03:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:03:58,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:03:58,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:03:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:04:00,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:04:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:04:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:04:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:04:02,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:04:03,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:04:04,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:04:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:04:05,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:04:06,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:04:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:04:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:04:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:04:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:04:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:04:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:04:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:04:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:04:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:04:12,598][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:04:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:04:13,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:04:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:04:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:04:15,887][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:04:16,545][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:04:17,203][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:04:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:04:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:04:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:04:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:04:20,491][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:04:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:04:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:04:22,800][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:04:23,458][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:04:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:04:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:04:25,433][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:04:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:04:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:04:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:04:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:04:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:04:29,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:04:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:04:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:04:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:04:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:04:32,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:04:33,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:04:34,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:04:36,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:04:36,056][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:04:41,246][__main__][INFO] - Iteration 317 took 57s (9.77% Gen, 81.28% Train). Generation: 5s, Training: 47s. Estimated remaining time: 11h 23m 5s. Estimated total time: 16h 6m 23s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 38s, 500 more iterations: 8h 3m 11s. [2026-03-25 19:04:41,248][__main__][INFO] - Starting iteration 317. [2026-03-25 19:04:41,252][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:04:41,253][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:04:46,144][__main__][INFO] - Number of regex retries in iteration 317: 0 [2026-03-25 19:04:46,145][__main__][INFO] - agents played in iteration 317 are Alice, Bob [2026-03-25 19:04:46,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:04:46,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:04:46,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:04:46,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:04:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:04:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:04:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:04:49,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:04:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:04:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:04:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:04:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:04:52,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:04:53,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:04:54,054][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:04:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:04:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:04:56,034][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:04:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:04:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:04:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:04:58,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:04:59,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:04:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:05:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:05:01,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:05:01,963][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:05:02,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:05:03,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:05:03,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:05:04,595][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:05:05,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:05:05,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:05:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:05:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:05:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:05:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:05:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:05:09,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:05:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:05:11,172][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:05:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:05:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:05:13,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:05:13,803][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:05:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:05:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:05:15,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:05:16,433][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:05:17,091][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:05:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:05:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:05:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:05:20,045][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:05:20,706][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:05:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:05:22,020][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:05:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:05:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:05:23,993][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:05:24,651][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:05:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:05:25,967][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:05:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:05:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:05:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:05:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:05:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:05:29,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:05:30,610][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:05:31,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:05:31,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:05:31,738][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:05:33,294][__main__][INFO] - Iteration 318 took 52s (9.40% Gen, 87.60% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 43m 14s. Estimated total time: 14h 27m 23s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 41s. [2026-03-25 19:05:33,297][__main__][INFO] - Starting iteration 318. [2026-03-25 19:05:33,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:05:33,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:05:38,972][__main__][INFO] - Number of regex retries in iteration 318: 0 [2026-03-25 19:05:38,974][__main__][INFO] - agents played in iteration 318 are Alice, Bob [2026-03-25 19:05:39,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:05:39,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:05:39,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:05:39,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:05:40,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:05:41,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:05:41,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:05:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:05:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:05:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:05:44,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:05:45,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:05:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:05:46,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:05:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:05:47,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:05:48,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:05:48,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:05:49,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:05:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:05:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:05:51,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:05:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:05:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:05:53,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:05:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:05:54,885][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:05:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:05:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:05:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:05:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:05:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:05:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:05:59,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:06:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:06:00,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:06:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:06:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:06:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:06:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:06:04,095][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:06:04,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:06:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:06:06,071][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:06:06,729][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:06:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:06:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:06:08,703][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:06:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:06:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:06:10,676][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:06:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:06:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:06:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:06:13,619][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:06:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:06:14,933][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:06:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:06:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:06:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:06:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:06:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:06:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:06:19,541][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:06:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:06:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:06:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:06:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:06:22,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:06:23,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:06:24,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:06:24,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:06:24,913][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:06:26,349][__main__][INFO] - Iteration 319 took 53s (10.69% Gen, 86.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 59m 6s. Estimated total time: 14h 44m 9s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 4s. [2026-03-25 19:06:26,353][__main__][INFO] - Starting iteration 319. [2026-03-25 19:06:26,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:06:26,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:06:31,406][__main__][INFO] - Number of regex retries in iteration 319: 0 [2026-03-25 19:06:31,407][__main__][INFO] - agents played in iteration 319 are Alice, Bob [2026-03-25 19:06:31,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:06:32,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:06:32,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:06:32,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:06:32,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:06:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:06:33,988][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:06:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:06:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:06:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:06:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:06:37,281][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:06:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:06:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:06:39,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:06:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:06:40,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:06:41,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:06:41,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:06:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:06:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:06:43,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:06:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:06:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:06:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:06:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:06:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:06:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:06:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:06:49,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:06:49,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:06:50,469][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:06:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:06:51,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:06:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:06:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:06:53,761][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:06:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:06:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:06:55,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:06:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:06:57,057][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:06:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:06:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:06:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:06:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:07:00,348][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:07:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:07:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:07:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:07:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:07:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:07:04,610][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:07:05,271][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:07:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:07:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:07:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:07:07,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:07:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:07:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:07:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:07:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:07:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:07:11,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:07:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:07:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:07:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:07:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:07:15,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:07:15,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:07:17,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:07:17,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:07:17,109][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:07:18,571][__main__][INFO] - Iteration 320 took 52s (9.67% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 44m 22s. Estimated total time: 14h 30m 17s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 8s. [2026-03-25 19:07:18,575][__main__][INFO] - Starting iteration 320. [2026-03-25 19:07:18,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:07:18,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:07:23,616][__main__][INFO] - Number of regex retries in iteration 320: 0 [2026-03-25 19:07:23,618][__main__][INFO] - agents played in iteration 320 are Alice, Bob [2026-03-25 19:07:24,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:07:24,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:07:24,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:07:24,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:07:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:07:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:07:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:07:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:07:27,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:07:28,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:07:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:07:29,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:07:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:07:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:07:31,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:07:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:07:33,041][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:07:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:07:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:07:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:07:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:07:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:07:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:07:37,663][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:07:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:07:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:07:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:07:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:07:40,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:07:41,624][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:07:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:07:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:07:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:07:44,267][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:07:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:07:45,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:07:46,248][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:07:46,907][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:07:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:07:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:07:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:07:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:07:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:07:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:07:51,530][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:07:52,190][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:07:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:07:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:07:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:07:54,831][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:07:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:07:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:07:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:07:57,722][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:07:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:07:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:07:59,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:08:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:08:01,019][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:08:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:08:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:08:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:08:03,658][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:08:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:08:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:08:05,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:08:06,296][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:08:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:08:07,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:08:08,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:08:09,509][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:08:09,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:08:09,513][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:08:10,848][__main__][INFO] - Iteration 321 took 52s (9.64% Gen, 87.80% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 44m 23s. Estimated total time: 14h 31m 10s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2026-03-25 19:08:10,852][__main__][INFO] - Starting iteration 321. [2026-03-25 19:08:10,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:08:10,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:08:15,935][__main__][INFO] - Number of regex retries in iteration 321: 0 [2026-03-25 19:08:15,937][__main__][INFO] - agents played in iteration 321 are Alice, Bob [2026-03-25 19:08:16,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:08:16,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:08:16,650][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:08:16,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:08:17,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:08:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:08:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:08:19,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:08:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:08:20,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:08:21,232][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:08:21,891][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:08:22,551][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:08:23,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:08:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:08:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:08:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:08:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:08:26,572][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:08:27,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:08:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:08:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:08:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:08:29,868][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:08:30,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:08:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:08:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:08:32,504][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:08:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:08:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:08:34,482][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:08:35,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:08:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:08:36,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:08:37,118][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:08:37,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:08:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:08:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:08:39,755][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:08:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:08:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:08:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:08:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:08:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:08:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:08:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:08:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:08:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:08:46,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:08:47,005][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:08:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:08:48,324][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:08:49,263][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:08:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:08:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:08:51,239][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:08:51,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:08:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:08:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:08:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:08:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:08:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:08:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:08:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:08:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:08:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:08:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:08:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:08:59,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:09:00,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:09:01,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:09:01,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:09:01,701][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:09:02,958][__main__][INFO] - Iteration 322 took 52s (9.75% Gen, 87.84% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 40m 42s. Estimated total time: 14h 28m 21s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 10s. [2026-03-25 19:09:02,962][__main__][INFO] - Starting iteration 322. [2026-03-25 19:09:02,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:09:02,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:09:11,174][__main__][INFO] - Number of regex retries in iteration 322: 0 [2026-03-25 19:09:11,176][__main__][INFO] - agents played in iteration 322 are Alice, Bob [2026-03-25 19:09:11,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:09:11,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:09:11,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:09:11,925][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:09:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:09:13,186][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:09:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:09:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:09:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:09:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:09:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:09:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:09:17,801][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:09:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:09:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:09:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:09:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:09:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:09:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:09:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:09:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:09:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:09:24,390][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:09:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:09:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:09:26,369][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:09:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:09:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:09:28,347][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:09:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:09:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:09:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:09:30,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:09:31,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:09:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:09:32,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:09:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:09:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:09:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:09:35,600][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:09:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:09:36,918][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:09:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:09:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:09:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:09:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:09:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:09:40,875][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:09:41,535][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:09:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:09:42,853][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:09:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:09:44,441][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:09:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:09:45,762][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:09:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:09:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:09:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:09:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:09:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:09:49,719][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:09:50,374][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:09:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:09:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:09:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:09:53,015][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:09:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:09:54,334][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:09:54,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:09:55,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:09:57,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:09:57,087][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:09:57,089][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:09:58,458][__main__][INFO] - Iteration 323 took 55s (14.76% Gen, 82.78% Train). Generation: 8s, Training: 45s. Estimated remaining time: 10h 35m 56s. Estimated total time: 15h 24m 30s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 15s. [2026-03-25 19:09:58,461][__main__][INFO] - Starting iteration 323. [2026-03-25 19:09:58,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:09:58,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:10:03,377][__main__][INFO] - Number of regex retries in iteration 323: 0 [2026-03-25 19:10:03,378][__main__][INFO] - agents played in iteration 323 are Alice, Bob [2026-03-25 19:10:03,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:03,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:03,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:10:03,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:10:04,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:10:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:10:05,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:10:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:10:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:10:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:10:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:10:09,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:10:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:10:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:10:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:10:11,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:10:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:10:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:10:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:10:14,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:10:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:10:15,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:10:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:10:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:10:17,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:10:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:10:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:10:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:10:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:10:21,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:10:21,768][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:10:22,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:10:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:10:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:10:24,402][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:10:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:10:25,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:10:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:10:27,038][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:10:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:10:28,356][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:10:29,015][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:10:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:10:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:10:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:10:31,650][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:10:32,308][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:10:32,968][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:10:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:10:34,286][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:10:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:10:35,603][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:10:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:10:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:10:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:10:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:10:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:10:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:10:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:10:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:10:41,819][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:10:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:10:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:10:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:10:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:10:45,113][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:10:45,773][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:10:46,432][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:10:47,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:10:47,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:10:48,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:10:49,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:10:49,001][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:10:50,342][__main__][INFO] - Iteration 324 took 51s (9.47% Gen, 87.94% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 35m 12s. Estimated total time: 14h 24m 38s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 19s. [2026-03-25 19:10:50,345][__main__][INFO] - Starting iteration 324. [2026-03-25 19:10:50,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:10:50,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:10:55,198][__main__][INFO] - Number of regex retries in iteration 324: 0 [2026-03-25 19:10:55,199][__main__][INFO] - agents played in iteration 324 are Alice, Bob [2026-03-25 19:10:55,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:55,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:55,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:10:55,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:10:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:10:57,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:10:57,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:10:58,449][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:10:59,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:10:59,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:11:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:11:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:11:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:11:02,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:11:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:11:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:11:04,373][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:11:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:11:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:11:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:11:07,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:11:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:11:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:11:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:11:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:11:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:11:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:11:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:11:12,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:11:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:11:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:11:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:11:14,906][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:11:15,565][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:11:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:11:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:11:17,541][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:11:18,200][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:11:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:11:19,519][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:11:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:11:20,836][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:11:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:11:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:11:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:11:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:11:24,129][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:11:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:11:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:11:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:11:26,767][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:11:27,425][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:11:28,332][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:11:28,991][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:11:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:11:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:11:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:11:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:11:32,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:11:32,943][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:11:33,602][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:11:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:11:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:11:35,578][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:11:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:11:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:11:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:11:38,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:11:38,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:11:39,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:11:40,724][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:11:40,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:11:40,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:11:42,162][__main__][INFO] - Iteration 325 took 51s (9.36% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 33m 16s. Estimated total time: 14h 23m 34s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 47s. [2026-03-25 19:11:42,165][__main__][INFO] - Starting iteration 325. [2026-03-25 19:11:42,169][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:11:42,170][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:11:47,006][__main__][INFO] - Number of regex retries in iteration 325: 0 [2026-03-25 19:11:47,007][__main__][INFO] - agents played in iteration 325 are Alice, Bob [2026-03-25 19:11:47,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:11:47,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:11:47,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:11:47,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:11:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:11:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:11:49,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:11:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:11:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:11:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:11:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:11:52,913][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:11:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:11:54,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:11:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:11:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:11:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:11:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:11:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:11:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:11:58,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:11:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:12:00,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:12:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:12:01,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:12:02,126][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:12:02,784][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:12:03,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:12:04,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:12:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:12:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:12:06,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:12:06,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:12:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:12:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:12:08,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:12:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:12:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:12:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:12:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:12:11,994][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:12:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:12:13,309][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:12:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:12:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:12:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:12:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:12:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:12:17,254][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:12:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:12:18,569][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:12:19,227][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:12:20,142][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:12:20,800][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:12:21,458][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:12:22,116][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:12:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:12:23,432][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:12:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:12:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:12:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:12:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:12:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:12:27,384][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:12:28,043][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:12:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:12:29,359][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:12:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:12:30,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:12:31,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:12:32,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:12:32,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:12:32,556][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:12:34,016][__main__][INFO] - Iteration 326 took 51s (9.33% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 32m 59s. Estimated total time: 14h 24m 9s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2026-03-25 19:12:34,018][__main__][INFO] - Starting iteration 326. [2026-03-25 19:12:34,022][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:12:34,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:12:40,643][__main__][INFO] - Number of regex retries in iteration 326: 0 [2026-03-25 19:12:40,644][__main__][INFO] - agents played in iteration 326 are Alice, Bob [2026-03-25 19:12:41,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:41,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:41,263][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:12:41,263][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:12:41,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:12:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:12:43,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:12:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:12:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:12:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:12:45,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:12:46,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:12:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:12:47,797][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:12:48,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:12:49,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:12:49,779][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:12:50,439][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:12:51,099][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:12:51,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:12:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:12:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:12:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:12:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:12:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:12:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:12:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:12:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:12:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:12:58,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:12:59,018][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:12:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:13:00,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:13:00,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:13:01,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:13:02,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:13:02,969][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:13:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:13:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:13:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:13:05,604][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:13:06,262][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:13:06,920][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:13:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:13:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:13:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:13:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:13:10,210][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:13:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:13:11,526][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:13:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:13:12,841][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:13:13,806][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:13:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:13:15,126][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:13:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:13:16,444][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:13:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:13:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:13:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:13:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:13:19,741][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:13:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:13:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:13:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:13:22,380][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:13:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:13:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:13:24,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:13:25,232][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:13:26,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:13:26,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:13:26,396][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:13:27,760][__main__][INFO] - Iteration 327 took 53s (12.32% Gen, 85.14% Train). Generation: 6s, Training: 45s. Estimated remaining time: 10h 3m 36s. Estimated total time: 14h 55m 40s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 50s. [2026-03-25 19:13:27,763][__main__][INFO] - Starting iteration 327. [2026-03-25 19:13:27,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:13:27,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:13:32,701][__main__][INFO] - Number of regex retries in iteration 327: 0 [2026-03-25 19:13:32,702][__main__][INFO] - agents played in iteration 327 are Alice, Bob [2026-03-25 19:13:33,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:13:33,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:13:33,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:13:33,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:13:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:13:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:13:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:13:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:13:36,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:13:37,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:13:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:13:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:13:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:13:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:13:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:13:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:13:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:13:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:13:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:13:43,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:13:44,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:13:45,140][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:13:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:13:46,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:13:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:13:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:13:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:13:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:13:49,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:13:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:13:51,069][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:13:51,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:13:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:13:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:13:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:13:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:13:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:13:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:13:56,339][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:13:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:13:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:13:58,312][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:13:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:13:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:14:00,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:14:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:14:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:14:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:14:02,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:14:03,574][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:14:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:14:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:14:05,777][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:14:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:14:07,095][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:14:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:14:08,411][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:14:09,069][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:14:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:14:10,386][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:14:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:14:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:14:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:14:13,019][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:14:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:14:14,335][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:14:14,995][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:14:15,652][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:14:16,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:14:17,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:14:18,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:14:18,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:14:18,136][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:14:19,377][__main__][INFO] - Iteration 328 took 51s (9.56% Gen, 88.03% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 27m 16s. Estimated total time: 14h 20m 11s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 1s, 500 more iterations: 7h 10m 5s. [2026-03-25 19:14:19,379][__main__][INFO] - Starting iteration 328. [2026-03-25 19:14:19,383][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:14:19,384][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:14:24,208][__main__][INFO] - Number of regex retries in iteration 328: 0 [2026-03-25 19:14:24,210][__main__][INFO] - agents played in iteration 328 are Alice, Bob [2026-03-25 19:14:24,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:24,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:24,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:14:24,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:14:25,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:14:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:14:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:14:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:14:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:14:28,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:14:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:14:30,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:14:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:14:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:14:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:14:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:14:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:14:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:14:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:14:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:14:36,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:14:36,785][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:14:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:14:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:14:38,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:14:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:14:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:14:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:14:41,391][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:14:42,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:14:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:14:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:14:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:14:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:14:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:14:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:14:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:14:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:14:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:14:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:14:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:14:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:14:50,597][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:14:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:14:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:14:52,569][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:14:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:14:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:14:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:14:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:14:55,857][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:14:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:14:57,496][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:14:58,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:14:58,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:14:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:15:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:15:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:15:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:15:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:15:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:15:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:15:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:15:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:15:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:15:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:15:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:15:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:15:08,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:15:08,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:15:09,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:15:09,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:15:09,981][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:15:11,502][__main__][INFO] - Iteration 329 took 52s (9.26% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 34m 53s. Estimated total time: 14h 28m 41s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 20s. [2026-03-25 19:15:11,506][__main__][INFO] - Starting iteration 329. [2026-03-25 19:15:11,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:15:11,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:15:16,407][__main__][INFO] - Number of regex retries in iteration 329: 0 [2026-03-25 19:15:16,408][__main__][INFO] - agents played in iteration 329 are Alice, Bob [2026-03-25 19:15:16,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:15:17,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:15:17,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:15:17,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:15:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:15:18,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:15:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:15:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:15:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:15:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:15:21,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:15:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:15:22,924][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:15:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:15:24,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:15:24,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:15:25,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:15:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:15:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:15:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:15:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:15:28,850][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:15:29,507][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:15:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:15:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:15:31,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:15:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:15:32,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:15:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:15:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:15:34,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:15:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:15:36,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:15:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:15:37,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:15:38,070][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:15:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:15:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:15:40,042][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:15:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:15:41,360][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:15:42,018][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:15:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:15:43,333][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:15:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:15:44,648][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:15:45,306][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:15:45,964][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:15:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:15:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:15:47,937][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:15:48,595][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:15:49,498][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:15:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:15:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:15:51,472][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:15:52,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:15:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:15:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:15:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:15:54,761][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:15:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:15:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:15:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:15:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:15:58,054][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:15:58,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:15:59,370][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:16:00,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:16:00,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:16:01,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:16:01,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:16:01,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:16:03,217][__main__][INFO] - Iteration 330 took 51s (9.46% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 27m 7s. Estimated total time: 14h 21m 46s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 53s. [2026-03-25 19:16:03,227][__main__][INFO] - Starting iteration 330. [2026-03-25 19:16:03,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:16:03,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:16:08,276][__main__][INFO] - Number of regex retries in iteration 330: 0 [2026-03-25 19:16:08,277][__main__][INFO] - agents played in iteration 330 are Alice, Bob [2026-03-25 19:16:08,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:16:08,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:16:08,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:16:08,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:16:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:16:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:16:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:16:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:16:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:16:12,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:16:13,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:16:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:16:14,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:16:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:16:16,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:16:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:16:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:16:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:16:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:16:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:16:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:16:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:16:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:16:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:16:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:16:23,470][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:16:24,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:16:24,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:16:25,442][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:16:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:16:26,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:16:27,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:16:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:16:28,729][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:16:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:16:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:16:30,704][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:16:31,362][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:16:32,020][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:16:32,677][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:16:33,334][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:16:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:16:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:16:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:16:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:16:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:16:37,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:16:37,937][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:16:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:16:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:16:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:16:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:16:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:16:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:16:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:16:43,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:16:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:16:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:16:45,497][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:16:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:16:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:16:47,475][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:16:48,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:16:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:16:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:16:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:16:50,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:16:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:16:52,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:16:52,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:16:54,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:16:54,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:16:54,060][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:16:55,593][__main__][INFO] - Iteration 331 took 52s (9.62% Gen, 87.44% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 37m 5s. Estimated total time: 14h 32m 36s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 18s. [2026-03-25 19:16:55,595][__main__][INFO] - Starting iteration 331. [2026-03-25 19:16:55,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:16:55,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:17:00,636][__main__][INFO] - Number of regex retries in iteration 331: 0 [2026-03-25 19:17:00,638][__main__][INFO] - agents played in iteration 331 are Alice, Bob [2026-03-25 19:17:01,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:01,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:01,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:17:01,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:17:01,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:17:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:17:03,176][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:17:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:17:04,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:17:05,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:17:05,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:17:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:17:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:17:07,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:17:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:17:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:17:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:17:10,416][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:17:11,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:17:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:17:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:17:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:17:13,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:17:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:17:15,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:17:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:17:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:17:16,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:17:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:17:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:17:18,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:17:19,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:17:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:17:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:17:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:17:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:17:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:17:23,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:17:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:17:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:17:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:17:26,200][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:17:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:17:27,518][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:17:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:17:28,836][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:17:29,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:17:30,152][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:17:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:17:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:17:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:17:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:17:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:17:34,434][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:17:35,092][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:17:35,750][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:17:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:17:37,070][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:17:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:17:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:17:39,046][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:17:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:17:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:17:41,021][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:17:41,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:17:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:17:42,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:17:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:17:44,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:17:45,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:17:46,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:17:46,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:17:46,194][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:17:47,516][__main__][INFO] - Iteration 332 took 51s (9.71% Gen, 87.74% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 28m 56s. Estimated total time: 14h 25m 19s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 39s. [2026-03-25 19:17:47,519][__main__][INFO] - Starting iteration 332. [2026-03-25 19:17:47,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:17:47,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:17:52,582][__main__][INFO] - Number of regex retries in iteration 332: 0 [2026-03-25 19:17:52,584][__main__][INFO] - agents played in iteration 332 are Alice, Bob [2026-03-25 19:17:53,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:53,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:53,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:17:53,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:17:53,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:17:54,502][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:17:55,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:17:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:17:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:17:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:17:57,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:17:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:17:59,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:17:59,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:18:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:18:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:18:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:18:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:18:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:18:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:18:04,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:18:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:18:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:18:06,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:18:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:18:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:18:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:18:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:18:09,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:18:10,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:18:10,959][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:18:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:18:12,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:18:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:18:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:18:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:18:14,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:18:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:18:16,228][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:18:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:18:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:18:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:18:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:18:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:18:20,186][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:18:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:18:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:18:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:18:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:18:23,478][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:18:24,137][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:18:24,796][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:18:25,716][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:18:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:18:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:18:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:18:28,353][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:18:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:18:29,670][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:18:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:18:30,987][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:18:31,647][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:18:32,305][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:18:32,964][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:18:33,622][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:18:34,280][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:18:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:18:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:18:36,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:18:36,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:18:38,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:18:38,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:18:38,131][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:18:39,497][__main__][INFO] - Iteration 333 took 51s (9.74% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 29m 0s. Estimated total time: 14h 26m 16s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 8s. [2026-03-25 19:18:39,502][__main__][INFO] - Starting iteration 333. [2026-03-25 19:18:39,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:18:39,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:18:44,384][__main__][INFO] - Number of regex retries in iteration 333: 0 [2026-03-25 19:18:44,385][__main__][INFO] - agents played in iteration 333 are Alice, Bob [2026-03-25 19:18:44,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:18:45,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:18:45,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:18:45,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:18:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:18:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:18:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:18:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:18:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:18:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:18:49,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:18:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:18:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:18:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:18:52,299][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:18:52,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:18:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:18:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:18:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:18:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:18:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:18:56,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:18:57,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:18:58,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:18:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:18:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:19:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:19:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:19:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:19:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:19:02,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:19:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:19:04,183][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:19:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:19:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:19:06,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:19:06,826][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:19:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:19:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:19:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:19:09,465][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:19:10,125][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:19:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:19:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:19:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:19:12,766][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:19:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:19:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:19:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:19:15,405][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:19:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:19:16,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:19:17,646][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:19:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:19:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:19:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:19:20,283][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:19:20,943][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:19:21,602][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:19:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:19:22,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:19:23,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:19:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:19:24,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:19:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:19:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:19:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:19:27,538][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:19:28,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:19:28,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:19:30,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:19:30,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:19:30,066][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:19:31,336][__main__][INFO] - Iteration 334 took 51s (9.41% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 25m 45s. Estimated total time: 14h 23m 52s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 56s. [2026-03-25 19:19:31,339][__main__][INFO] - Starting iteration 334. [2026-03-25 19:19:31,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:19:31,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:19:36,688][__main__][INFO] - Number of regex retries in iteration 334: 0 [2026-03-25 19:19:36,690][__main__][INFO] - agents played in iteration 334 are Alice, Bob [2026-03-25 19:19:37,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:19:37,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:19:37,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:19:37,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:19:38,072][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:19:38,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:19:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:19:40,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:19:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:19:41,326][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:19:41,987][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:19:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:19:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:19:43,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:19:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:19:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:19:45,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:19:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:19:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:19:47,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:19:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:19:49,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:19:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:19:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:19:51,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:19:51,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:19:52,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:19:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:19:53,876][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:19:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:19:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:19:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:19:56,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:19:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:19:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:19:58,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:19:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:19:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:20:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:20:01,134][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:20:01,793][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:20:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:20:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:20:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:20:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:20:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:20:05,753][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:20:06,412][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:20:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:20:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:20:08,393][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:20:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:20:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:20:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:20:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:20:12,024][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:20:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:20:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:20:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:20:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:20:15,321][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:20:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:20:16,640][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:20:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:20:17,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:20:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:20:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:20:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:20:20,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:20:21,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:20:22,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:20:22,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:20:22,478][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:20:24,098][__main__][INFO] - Iteration 335 took 52s (10.13% Gen, 86.79% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 40m 16s. Estimated total time: 14h 39m 17s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 38s. [2026-03-25 19:20:24,101][__main__][INFO] - Starting iteration 335. [2026-03-25 19:20:24,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:20:24,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:20:29,249][__main__][INFO] - Number of regex retries in iteration 335: 0 [2026-03-25 19:20:29,251][__main__][INFO] - agents played in iteration 335 are Alice, Bob [2026-03-25 19:20:29,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:20:29,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:20:29,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:20:29,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:20:30,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:20:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:20:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:20:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:20:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:20:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:20:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:20:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:20:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:20:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:20:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:20:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:20:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:20:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:20:39,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:20:40,323][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:20:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:20:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:20:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:20:42,966][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:20:43,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:20:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:20:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:20:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:20:46,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:20:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:20:47,579][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:20:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:20:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:20:49,557][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:20:50,215][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:20:50,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:20:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:20:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:20:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:20:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:20:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:20:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:20:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:20:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:20:56,806][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:20:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:20:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:20:58,782][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:20:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:21:00,099][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:21:00,757][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:21:01,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:21:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:21:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:21:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:21:04,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:21:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:21:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:21:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:21:06,989][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:21:07,648][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:21:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:21:08,966][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:21:09,624][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:21:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:21:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:21:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:21:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:21:12,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:21:13,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:21:14,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:21:14,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:21:14,770][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:21:16,272][__main__][INFO] - Iteration 336 took 52s (9.86% Gen, 87.25% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 29m 36s. Estimated total time: 14h 29m 29s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 44s. [2026-03-25 19:21:16,275][__main__][INFO] - Starting iteration 336. [2026-03-25 19:21:16,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:21:16,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:21:18,160][mllm.models.large_language_model_local][WARNING] - Response ngle quote instead of did not match regex: (|), retry 1/1 [2026-03-25 19:21:23,774][__main__][INFO] - Number of regex retries in iteration 336: 1 [2026-03-25 19:21:23,775][__main__][INFO] - agents played in iteration 336 are Alice, Bob [2026-03-25 19:21:24,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:21:24,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:21:24,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:21:24,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:21:25,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:21:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:21:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:21:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:21:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:21:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:21:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:21:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:21:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:21:30,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:21:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:21:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:21:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:21:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:21:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:21:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:21:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:21:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:21:36,904][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:21:37,563][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:21:38,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:21:38,880][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:21:39,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:21:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:21:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:21:41,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:21:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:21:42,833][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:21:43,492][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:21:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:21:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:21:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:21:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:21:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:21:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:21:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:21:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:21:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:21:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:21:50,745][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:21:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:21:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:21:52,722][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:21:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:21:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:21:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:21:55,358][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:21:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:21:56,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:21:57,590][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:21:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:21:58,906][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:21:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:22:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:22:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:22:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:22:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:22:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:22:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:22:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:22:04,827][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:22:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:22:06,145][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:22:06,803][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:22:07,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:22:08,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:22:09,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:22:09,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:22:09,455][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:22:10,850][__main__][INFO] - Iteration 337 took 54s (13.74% Gen, 83.70% Train). Generation: 7s, Training: 45s. Estimated remaining time: 10h 8m 46s. Estimated total time: 15h 9m 33s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 57s, 500 more iterations: 7h 34m 46s. [2026-03-25 19:22:10,853][__main__][INFO] - Starting iteration 337. [2026-03-25 19:22:10,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:22:10,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:22:15,816][__main__][INFO] - Number of regex retries in iteration 337: 0 [2026-03-25 19:22:15,818][__main__][INFO] - agents played in iteration 337 are Alice, Bob [2026-03-25 19:22:16,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:22:16,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:22:16,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:22:16,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:22:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:22:17,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:22:18,389][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:22:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:22:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:22:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:22:21,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:22:21,674][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:22:22,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:22:22,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:22:23,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:22:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:22:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:22:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:22:26,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:22:26,940][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:22:27,598][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:22:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:22:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:22:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:22:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:22:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:22:31,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:22:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:22:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:22:33,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:22:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:22:34,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:22:35,488][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:22:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:22:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:22:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:22:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:22:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:22:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:22:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:22:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:22:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:22:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:22:42,729][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:22:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:22:44,045][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:22:44,703][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:22:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:22:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:22:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:22:47,334][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:22:47,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:22:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:22:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:22:50,232][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:22:50,890][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:22:51,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:22:52,206][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:22:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:22:53,521][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:22:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:22:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:22:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:22:56,155][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:22:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:22:57,470][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:22:58,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:22:58,784][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:22:59,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:23:00,211][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:23:01,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:23:01,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:23:01,355][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:23:02,541][__main__][INFO] - Iteration 338 took 51s (9.60% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 19m 48s. Estimated total time: 14h 21m 26s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 8s, 500 more iterations: 7h 10m 43s. [2026-03-25 19:23:02,544][__main__][INFO] - Starting iteration 338. [2026-03-25 19:23:02,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:23:02,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:23:07,537][__main__][INFO] - Number of regex retries in iteration 338: 0 [2026-03-25 19:23:07,538][__main__][INFO] - agents played in iteration 338 are Alice, Bob [2026-03-25 19:23:08,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:08,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:08,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:23:08,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:23:08,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:23:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:23:10,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:23:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:23:11,363][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:23:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:23:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:23:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:23:13,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:23:14,649][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:23:15,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:23:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:23:16,623][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:23:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:23:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:23:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:23:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:23:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:23:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:23:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:23:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:23:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:23:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:23:23,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:23:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:23:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:23:25,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:23:26,485][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:23:27,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:23:27,800][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:23:28,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:23:29,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:23:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:23:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:23:31,086][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:23:31,743][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:23:32,401][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:23:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:23:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:23:34,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:23:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:23:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:23:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:23:37,001][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:23:37,659][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:23:38,316][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:23:38,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:23:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:23:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:23:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:23:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:23:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:23:43,174][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:23:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:23:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:23:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:23:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:23:46,462][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:23:47,121][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:23:47,780][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:23:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:23:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:23:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:23:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:23:51,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:23:51,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:23:52,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:23:52,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:23:52,967][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:23:54,349][__main__][INFO] - Iteration 339 took 51s (9.63% Gen, 87.70% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 20m 52s. Estimated total time: 14h 23m 22s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 20s, 500 more iterations: 7h 11m 41s. [2026-03-25 19:23:54,351][__main__][INFO] - Starting iteration 339. [2026-03-25 19:23:54,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:23:54,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:23:59,412][__main__][INFO] - Number of regex retries in iteration 339: 0 [2026-03-25 19:23:59,414][__main__][INFO] - agents played in iteration 339 are Alice, Bob [2026-03-25 19:24:00,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:00,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:00,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:24:00,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:24:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:24:01,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:24:02,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:24:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:24:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:24:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:24:04,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:24:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:24:05,962][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:24:06,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:24:07,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:24:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:24:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:24:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:24:09,908][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:24:10,567][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:24:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:24:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:24:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:24:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:24:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:24:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:24:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:24:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:24:16,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:24:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:24:17,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:24:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:24:19,126][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:24:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:24:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:24:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:24:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:24:22,418][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:24:23,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:24:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:24:24,392][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:24:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:24:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:24:26,368][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:24:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:24:27,685][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:24:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:24:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:24:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:24:30,322][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:24:30,981][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:24:31,641][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:24:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:24:33,227][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:24:33,885][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:24:34,544][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:24:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:24:35,861][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:24:36,519][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:24:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:24:37,836][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:24:38,494][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:24:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:24:39,810][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:24:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:24:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:24:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:24:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:24:43,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:24:43,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:24:44,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:24:44,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:24:44,930][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:24:46,253][__main__][INFO] - Iteration 340 took 51s (9.75% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 21m 37s. Estimated total time: 14h 24m 59s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 29s. [2026-03-25 19:24:46,255][__main__][INFO] - Starting iteration 340. [2026-03-25 19:24:46,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:24:46,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:24:51,228][__main__][INFO] - Number of regex retries in iteration 340: 0 [2026-03-25 19:24:51,229][__main__][INFO] - agents played in iteration 340 are Alice, Bob [2026-03-25 19:24:51,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:51,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:51,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:24:51,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:24:52,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:24:53,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:24:53,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:24:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:24:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:24:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:24:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:24:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:24:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:24:58,452][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:24:59,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:24:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:25:00,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:25:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:25:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:25:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:25:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:25:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:25:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:25:05,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:25:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:25:06,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:25:07,035][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:25:07,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:25:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:25:09,018][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:25:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:25:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:25:10,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:25:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:25:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:25:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:25:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:25:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:25:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:25:15,622][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:25:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:25:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:25:17,607][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:25:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:25:18,926][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:25:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:25:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:25:20,906][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:25:21,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:25:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:25:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:25:23,544][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:25:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:25:25,146][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:25:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:25:26,464][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:25:27,124][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:25:27,784][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:25:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:25:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:25:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:25:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:25:31,085][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:25:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:25:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:25:33,071][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:25:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:25:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:25:35,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:25:35,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:25:37,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:25:37,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:25:37,047][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:25:38,359][__main__][INFO] - Iteration 341 took 52s (9.54% Gen, 87.94% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 24m 8s. Estimated total time: 14h 28m 22s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 11s. [2026-03-25 19:25:38,361][__main__][INFO] - Starting iteration 341. [2026-03-25 19:25:38,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:25:38,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:25:43,429][__main__][INFO] - Number of regex retries in iteration 341: 0 [2026-03-25 19:25:43,431][__main__][INFO] - agents played in iteration 341 are Alice, Bob [2026-03-25 19:25:43,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:25:44,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:25:44,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:25:44,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:25:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:25:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:25:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:25:46,690][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:25:47,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:25:48,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:25:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:25:49,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:25:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:25:50,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:25:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:25:51,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:25:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:25:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:25:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:25:54,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:25:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:25:55,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:25:56,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:25:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:25:57,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:25:58,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:25:59,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:25:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:26:00,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:26:01,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:26:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:26:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:26:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:26:03,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:26:04,497][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:26:05,156][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:26:05,816][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:26:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:26:07,133][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:26:07,793][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:26:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:26:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:26:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:26:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:26:11,088][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:26:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:26:12,407][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:26:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:26:13,725][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:26:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:26:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:26:15,702][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:26:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:26:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:26:17,984][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:26:18,642][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:26:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:26:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:26:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:26:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:26:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:26:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:26:23,250][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:26:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:26:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:26:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:26:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:26:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:26:27,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:26:27,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:26:29,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:26:29,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:26:29,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:26:30,400][__main__][INFO] - Iteration 342 took 52s (9.72% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 22m 2s. Estimated total time: 14h 27m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 34s. [2026-03-25 19:26:30,403][__main__][INFO] - Starting iteration 342. [2026-03-25 19:26:30,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:26:30,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:26:35,527][__main__][INFO] - Number of regex retries in iteration 342: 0 [2026-03-25 19:26:35,529][__main__][INFO] - agents played in iteration 342 are Alice, Bob [2026-03-25 19:26:36,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:26:36,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:26:36,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:26:36,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:26:36,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:26:37,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:26:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:26:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:26:39,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:26:40,039][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:26:40,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:26:41,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:26:42,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:26:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:26:43,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:26:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:26:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:26:45,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:26:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:26:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:26:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:26:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:26:48,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:26:49,253][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:26:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:26:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:26:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:26:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:26:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:26:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:26:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:26:54,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:26:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:26:55,836][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:26:56,495][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:26:57,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:26:57,811][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:26:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:26:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:26:59,786][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:27:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:27:01,102][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:27:01,761][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:27:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:27:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:27:03,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:27:04,393][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:27:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:27:05,708][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:27:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:27:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:27:07,683][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:27:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:27:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:27:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:27:10,640][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:27:11,299][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:27:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:27:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:27:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:27:13,932][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:27:14,590][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:27:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:27:15,906][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:27:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:27:17,223][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:27:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:27:18,539][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:27:19,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:27:19,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:27:21,117][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:27:21,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:27:21,121][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:27:22,487][__main__][INFO] - Iteration 343 took 52s (9.83% Gen, 87.54% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 22m 3s. Estimated total time: 14h 28m 2s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 1s. [2026-03-25 19:27:22,489][__main__][INFO] - Starting iteration 343. [2026-03-25 19:27:22,495][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:27:22,495][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:27:28,917][__main__][INFO] - Number of regex retries in iteration 343: 0 [2026-03-25 19:27:28,918][__main__][INFO] - agents played in iteration 343 are Alice, Bob [2026-03-25 19:27:29,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:27:29,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:27:29,588][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:27:29,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:27:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:27:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:27:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:27:32,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:27:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:27:33,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:27:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:27:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:27:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:27:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:27:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:27:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:27:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:27:38,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:27:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:27:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:27:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:27:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:27:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:27:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:27:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:27:44,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:27:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:27:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:27:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:27:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:27:47,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:27:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:27:48,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:27:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:27:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:27:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:27:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:27:51,921][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:27:52,578][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:27:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:27:53,894][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:27:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:27:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:27:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:27:56,524][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:27:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:27:57,841][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:27:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:27:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:27:59,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:28:00,471][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:28:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:28:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:28:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:28:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:28:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:28:04,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:28:05,326][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:28:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:28:06,643][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:28:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:28:07,962][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:28:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:28:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:28:09,937][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:28:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:28:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:28:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:28:12,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:28:13,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:28:14,527][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:28:14,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:28:14,532][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:28:16,691][__main__][INFO] - Iteration 344 took 54s (11.85% Gen, 84.16% Train). Generation: 6s, Training: 45s. Estimated remaining time: 9h 56m 25s. Estimated total time: 15h 3m 18s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 19s, 500 more iterations: 7h 31m 39s. [2026-03-25 19:28:16,694][__main__][INFO] - Starting iteration 344. [2026-03-25 19:28:16,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:28:16,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:28:21,774][__main__][INFO] - Number of regex retries in iteration 344: 0 [2026-03-25 19:28:21,775][__main__][INFO] - agents played in iteration 344 are Alice, Bob [2026-03-25 19:28:22,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:28:22,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:28:22,407][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:28:22,408][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:28:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:28:23,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:28:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:28:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:28:25,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:28:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:28:26,993][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:28:27,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:28:28,309][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:28:28,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:28:29,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:28:30,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:28:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:28:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:28:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:28:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:28:33,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:28:34,238][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:28:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:28:35,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:28:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:28:36,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:28:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:28:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:28:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:28:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:28:40,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:28:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:28:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:28:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:28:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:28:43,457][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:28:44,115][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:28:44,774][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:28:45,433][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:28:46,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:28:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:28:47,408][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:28:48,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:28:48,725][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:28:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:28:50,042][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:28:50,700][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:28:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:28:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:28:52,680][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:28:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:28:53,999][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:28:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:28:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:28:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:28:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:28:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:28:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:28:58,958][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:28:59,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:29:00,276][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:29:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:29:01,595][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:29:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:29:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:29:03,572][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:29:04,230][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:29:04,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:29:05,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:29:06,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:29:07,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:29:07,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:29:07,470][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:29:08,784][__main__][INFO] - Iteration 345 took 52s (9.74% Gen, 87.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 20m 22s. Estimated total time: 14h 28m 7s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 3s. [2026-03-25 19:29:08,786][__main__][INFO] - Starting iteration 345. [2026-03-25 19:29:08,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:29:08,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:29:13,943][__main__][INFO] - Number of regex retries in iteration 345: 0 [2026-03-25 19:29:13,945][__main__][INFO] - agents played in iteration 345 are Alice, Bob [2026-03-25 19:29:14,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:29:14,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:29:14,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:29:14,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:29:15,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:29:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:29:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:29:17,262][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:29:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:29:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:29:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:29:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:29:20,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:29:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:29:21,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:29:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:29:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:29:23,847][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:29:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:29:25,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:29:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:29:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:29:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:29:27,801][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:29:28,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:29:29,118][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:29:29,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:29:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:29:31,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:29:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:29:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:29:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:29:33,727][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:29:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:29:35,043][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:29:35,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:29:36,361][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:29:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:29:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:29:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:29:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:29:39,653][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:29:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:29:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:29:41,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:29:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:29:42,948][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:29:43,606][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:29:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:29:44,924][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:29:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:29:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:29:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:29:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:29:48,511][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:29:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:29:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:29:50,486][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:29:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:29:51,803][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:29:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:29:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:29:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:29:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:29:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:29:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:29:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:29:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:29:57,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:29:58,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:29:59,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:29:59,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:29:59,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:30:01,163][__main__][INFO] - Iteration 346 took 52s (9.84% Gen, 87.26% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 24m 17s. Estimated total time: 14h 32m 54s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 27s. [2026-03-25 19:30:01,167][__main__][INFO] - Starting iteration 346. [2026-03-25 19:30:01,172][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:30:01,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:30:06,084][__main__][INFO] - Number of regex retries in iteration 346: 0 [2026-03-25 19:30:06,085][__main__][INFO] - agents played in iteration 346 are Alice, Bob [2026-03-25 19:30:06,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:06,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:06,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:30:06,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:30:07,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:30:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:30:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:30:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:30:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:30:10,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:30:11,368][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:30:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:30:12,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:30:13,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:30:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:30:14,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:30:15,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:30:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:30:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:30:17,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:30:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:30:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:30:19,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:30:19,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:30:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:30:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:30:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:30:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:30:23,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:30:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:30:24,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:30:25,196][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:30:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:30:26,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:30:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:30:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:30:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:30:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:30:29,805][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:30:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:30:31,120][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:30:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:30:32,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:30:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:30:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:30:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:30:35,065][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:30:35,723][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:30:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:30:37,039][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:30:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:30:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:30:39,337][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:30:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:30:40,652][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:30:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:30:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:30:42,626][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:30:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:30:43,942][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:30:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:30:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:30:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:30:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:30:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:30:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:30:48,547][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:30:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:30:49,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:30:50,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:30:51,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:30:51,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:30:51,777][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:30:53,259][__main__][INFO] - Iteration 347 took 52s (9.43% Gen, 87.72% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 18m 40s. Estimated total time: 14h 28m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 4s. [2026-03-25 19:30:53,263][__main__][INFO] - Starting iteration 347. [2026-03-25 19:30:53,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:30:53,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:30:58,358][__main__][INFO] - Number of regex retries in iteration 347: 0 [2026-03-25 19:30:58,359][__main__][INFO] - agents played in iteration 347 are Alice, Bob [2026-03-25 19:30:59,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:59,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:59,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:30:59,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:30:59,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:31:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:31:02,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:31:03,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:31:03,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:31:04,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:31:05,016][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:31:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:31:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:31:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:31:07,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:31:08,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:31:08,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:31:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:31:10,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:31:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:31:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:31:12,256][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:31:12,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:31:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:31:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:31:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:31:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:31:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:31:16,867][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:31:17,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:31:18,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:31:18,842][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:31:19,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:31:20,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:31:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:31:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:31:22,133][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:31:22,791][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:31:23,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:31:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:31:24,767][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:31:25,425][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:31:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:31:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:31:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:31:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:31:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:31:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:31:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:31:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:31:31,658][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:31:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:31:32,977][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:31:33,635][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:31:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:31:34,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:31:35,611][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:31:36,269][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:31:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:31:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:31:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:31:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:31:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:31:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:31:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:31:41,537][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:31:42,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:31:42,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:31:44,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:31:44,040][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:31:44,042][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:31:45,493][__main__][INFO] - Iteration 348 took 52s (9.75% Gen, 87.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 20m 6s. Estimated total time: 14h 30m 27s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 13s. [2026-03-25 19:31:45,496][__main__][INFO] - Starting iteration 348. [2026-03-25 19:31:45,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:31:45,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:31:50,777][__main__][INFO] - Number of regex retries in iteration 348: 0 [2026-03-25 19:31:50,778][__main__][INFO] - agents played in iteration 348 are Alice, Bob [2026-03-25 19:31:51,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:51,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:51,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:31:51,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:31:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:52,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:31:54,044][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:31:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:31:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:31:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:31:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:31:57,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:31:58,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:31:58,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:31:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:31:59,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:32:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:32:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:32:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:32:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:32:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:32:03,920][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:32:04,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:32:05,236][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:32:05,893][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:32:06,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:32:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:32:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:32:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:32:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:32:09,840][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:32:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:32:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:32:11,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:32:12,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:32:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:32:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:32:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:32:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:32:15,760][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:32:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:32:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:32:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:32:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:32:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:32:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:32:20,367][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:32:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:32:21,684][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:32:22,343][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:32:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:32:23,909][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:32:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:32:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:32:25,885][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:32:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:32:27,201][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:32:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:32:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:32:29,176][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:32:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:32:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:32:31,151][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:32:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:32:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:32:33,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:32:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:32:34,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:32:35,186][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:32:36,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:32:36,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:32:36,304][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:32:37,819][__main__][INFO] - Iteration 349 took 52s (10.09% Gen, 87.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 20m 47s. Estimated total time: 14h 32m 1s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 0s. [2026-03-25 19:32:37,824][__main__][INFO] - Starting iteration 349. [2026-03-25 19:32:37,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:32:37,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:32:42,900][__main__][INFO] - Number of regex retries in iteration 349: 0 [2026-03-25 19:32:42,901][__main__][INFO] - agents played in iteration 349 are Alice, Bob [2026-03-25 19:32:43,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:32:43,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:32:43,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:32:43,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:32:44,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:32:44,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:32:45,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:32:46,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:32:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:32:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:32:48,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:32:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:32:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:32:50,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:32:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:32:51,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:32:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:32:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:32:53,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:32:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:32:54,721][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:32:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:32:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:32:56,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:32:57,355][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:32:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:32:58,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:32:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:32:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:33:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:33:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:33:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:33:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:33:03,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:33:03,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:33:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:33:05,249][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:33:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:33:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:33:07,224][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:33:07,881][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:33:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:33:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:33:09,856][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:33:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:33:11,172][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:33:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:33:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:33:13,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:33:13,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:33:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:33:15,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:33:16,033][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:33:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:33:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:33:18,008][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:33:18,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:33:19,324][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:33:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:33:20,640][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:33:21,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:33:21,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:33:22,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:33:23,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:33:23,929][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:33:24,588][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:33:25,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:33:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:33:26,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:33:27,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:33:28,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:33:28,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:33:28,457][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:33:29,837][__main__][INFO] - Iteration 350 took 52s (9.75% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 14m 42s. Estimated total time: 14h 26m 48s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 24s. [2026-03-25 19:33:29,839][__main__][INFO] - Starting iteration 350. [2026-03-25 19:33:29,843][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:33:29,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:33:35,837][__main__][INFO] - Number of regex retries in iteration 350: 0 [2026-03-25 19:33:35,838][__main__][INFO] - agents played in iteration 350 are Alice, Bob [2026-03-25 19:33:36,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:33:36,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:33:36,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:33:36,387][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:33:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:33:37,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:33:38,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:33:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:33:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:33:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:33:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:33:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:33:42,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:33:43,090][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:33:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:33:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:33:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:33:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:33:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:33:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:33:47,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:33:48,357][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:33:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:33:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:33:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:33:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:33:51,645][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:33:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:33:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:33:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:33:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:33:54,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:33:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:33:56,259][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:33:56,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:33:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:33:58,233][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:33:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:33:59,549][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:34:00,207][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:34:00,865][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:34:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:34:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:34:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:34:03,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:34:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:34:04,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:34:05,469][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:34:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:34:06,786][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:34:07,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:34:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:34:09,089][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:34:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:34:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:34:11,064][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:34:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:34:12,379][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:34:13,037][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:34:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:34:14,354][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:34:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:34:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:34:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:34:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:34:17,645][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:34:18,303][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:34:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:34:19,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:34:20,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:34:21,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:34:21,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:34:21,517][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:34:26,137][__main__][INFO] - Iteration 351 took 56s (10.65% Gen, 81.14% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 25m 14s. Estimated total time: 15h 38m 16s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 49s, 500 more iterations: 7h 49m 8s. [2026-03-25 19:34:26,140][__main__][INFO] - Starting iteration 351. [2026-03-25 19:34:26,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:34:26,145][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:34:31,341][__main__][INFO] - Number of regex retries in iteration 351: 0 [2026-03-25 19:34:31,342][__main__][INFO] - agents played in iteration 351 are Alice, Bob [2026-03-25 19:34:31,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:34:32,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:34:32,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:34:32,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:34:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:34:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:34:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:34:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:34:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:34:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:34:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:34:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:34:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:34:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:34:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:34:39,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:34:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:34:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:34:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:34:42,499][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:34:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:34:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:34:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:34:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:34:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:34:46,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:34:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:34:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:34:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:34:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:34:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:34:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:34:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:34:51,707][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:34:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:34:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:34:53,678][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:34:54,336][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:34:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:34:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:34:56,307][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:34:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:34:57,621][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:34:58,278][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:34:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:34:59,592][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:35:00,250][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:35:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:35:01,564][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:35:02,221][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:35:02,879][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:35:03,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:35:04,492][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:35:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:35:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:35:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:35:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:35:07,778][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:35:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:35:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:35:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:35:10,408][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:35:11,066][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:35:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:35:12,380][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:35:13,037][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:35:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:35:14,352][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:35:15,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:35:15,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:35:16,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:35:16,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:35:16,839][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:35:25,217][__main__][INFO] - Iteration 352 took 59s (8.80% Gen, 77.02% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 10m 33s. Estimated total time: 16h 24m 34s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 27s, 500 more iterations: 8h 12m 17s. [2026-03-25 19:35:25,220][__main__][INFO] - Starting iteration 352. [2026-03-25 19:35:25,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:35:25,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:35:30,310][__main__][INFO] - Number of regex retries in iteration 352: 0 [2026-03-25 19:35:30,312][__main__][INFO] - agents played in iteration 352 are Alice, Bob [2026-03-25 19:35:30,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:35:30,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:35:30,862][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:35:30,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:35:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:35:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:35:32,791][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:35:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:35:34,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:35:34,760][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:35:35,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:35:36,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:35:36,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:35:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:35:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:35:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:35:39,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:35:40,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:35:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:35:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:35:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:35:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:35:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:35:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:35:44,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:35:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:35:45,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:35:46,609][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:35:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:35:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:35:48,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:35:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:35:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:35:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:35:51,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:35:51,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:35:52,539][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:35:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:35:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:35:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:35:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:35:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:35:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:35:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:35:57,809][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:35:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:35:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:35:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:36:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:36:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:36:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:36:02,417][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:36:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:36:03,972][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:36:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:36:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:36:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:36:06,605][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:36:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:36:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:36:08,584][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:36:09,243][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:36:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:36:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:36:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:36:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:36:12,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:36:13,193][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:36:13,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:36:14,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:36:15,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:36:15,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:36:15,715][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:36:17,186][__main__][INFO] - Iteration 353 took 51s (9.79% Gen, 87.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 11m 10s. Estimated total time: 14h 26m 3s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 1s. [2026-03-25 19:36:17,188][__main__][INFO] - Starting iteration 353. [2026-03-25 19:36:17,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:36:17,193][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:36:27,712][__main__][INFO] - Number of regex retries in iteration 353: 0 [2026-03-25 19:36:27,713][__main__][INFO] - agents played in iteration 353 are Alice, Bob [2026-03-25 19:36:28,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:36:28,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:36:28,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:36:28,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:36:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:36:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:36:30,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:36:30,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:36:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:36:32,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:36:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:36:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:36:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:36:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:36:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:36:36,217][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:36:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:36:37,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:36:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:36:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:36:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:36:40,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:36:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:36:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:36:42,139][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:36:42,797][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:36:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:36:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:36:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:36:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:36:46,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:36:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:36:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:36:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:36:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:36:49,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:36:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:36:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:36:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:36:52,011][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:36:52,668][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:36:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:36:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:36:54,642][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:36:55,300][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:36:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:36:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:36:57,276][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:36:57,935][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:36:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:36:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:36:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:37:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:37:01,486][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:37:02,144][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:37:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:37:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:37:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:37:04,784][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:37:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:37:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:37:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:37:07,422][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:37:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:37:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:37:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:37:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:37:10,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:37:11,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:37:12,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:37:13,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:37:13,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:37:13,218][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:37:14,428][__main__][INFO] - Iteration 354 took 57s (18.38% Gen, 79.50% Train). Generation: 10s, Training: 45s. Estimated remaining time: 10h 38m 7s. Estimated total time: 15h 53m 57s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 23s, 500 more iterations: 7h 56m 58s. [2026-03-25 19:37:14,430][__main__][INFO] - Starting iteration 354. [2026-03-25 19:37:14,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:37:14,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:37:19,578][__main__][INFO] - Number of regex retries in iteration 354: 0 [2026-03-25 19:37:19,579][__main__][INFO] - agents played in iteration 354 are Alice, Bob [2026-03-25 19:37:20,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:37:20,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:37:20,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:37:20,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:37:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:37:21,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:37:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:37:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:37:23,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:37:24,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:37:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:37:25,437][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:37:26,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:37:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:37:27,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:37:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:37:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:37:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:37:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:37:30,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:37:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:37:32,044][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:37:32,704][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:37:33,365][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:37:34,025][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:37:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:37:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:37:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:37:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:37:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:37:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:37:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:37:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:37:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:37:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:37:41,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:37:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:37:42,617][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:37:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:37:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:37:44,597][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:37:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:37:45,919][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:37:46,577][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:37:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:37:47,898][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:37:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:37:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:37:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:37:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:37:51,200][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:37:51,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:37:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:37:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:37:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:37:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:37:55,476][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:37:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:37:56,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:37:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:37:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:37:58,779][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:37:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:38:00,100][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:38:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:38:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:38:02,081][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:38:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:38:03,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:38:04,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:38:05,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:38:05,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:38:05,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:38:06,852][__main__][INFO] - Iteration 355 took 52s (9.81% Gen, 87.25% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 16m 57s. Estimated total time: 14h 33m 40s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 50s. [2026-03-25 19:38:06,855][__main__][INFO] - Starting iteration 355. [2026-03-25 19:38:06,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:38:06,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:38:12,027][__main__][INFO] - Number of regex retries in iteration 355: 0 [2026-03-25 19:38:12,028][__main__][INFO] - agents played in iteration 355 are Alice, Bob [2026-03-25 19:38:12,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:38:12,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:38:12,653][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:38:12,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:38:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:38:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:38:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:38:15,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:38:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:38:16,580][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:38:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:38:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:38:18,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:38:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:38:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:38:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:38:21,201][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:38:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:38:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:38:23,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:38:23,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:38:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:38:25,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:38:25,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:38:26,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:38:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:38:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:38:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:38:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:38:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:38:30,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:38:31,112][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:38:31,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:38:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:38:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:38:33,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:38:34,416][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:38:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:38:35,739][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:38:36,399][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:38:37,060][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:38:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:38:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:38:39,042][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:38:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:38:40,362][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:38:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:38:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:38:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:38:43,007][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:38:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:38:44,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:38:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:38:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:38:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:38:47,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:38:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:38:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:38:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:38:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:38:50,598][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:38:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:38:51,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:38:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:38:53,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:38:53,900][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:38:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:38:55,220][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:38:55,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:38:56,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:38:57,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:38:57,845][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:38:57,846][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:38:59,135][__main__][INFO] - Iteration 356 took 52s (9.89% Gen, 87.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 13m 42s. Estimated total time: 14h 31m 17s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 38s. [2026-03-25 19:38:59,137][__main__][INFO] - Starting iteration 356. [2026-03-25 19:38:59,141][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:38:59,142][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:39:04,240][__main__][INFO] - Number of regex retries in iteration 356: 0 [2026-03-25 19:39:04,241][__main__][INFO] - agents played in iteration 356 are Alice, Bob [2026-03-25 19:39:04,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:04,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:04,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:39:04,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:39:05,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:39:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:39:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:39:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:39:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:39:08,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:39:09,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:39:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:39:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:39:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:39:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:39:12,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:39:13,409][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:39:14,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:39:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:39:15,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:39:16,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:39:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:39:17,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:39:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:39:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:39:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:39:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:39:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:39:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:39:21,992][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:39:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:39:23,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:39:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:39:24,633][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:39:25,293][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:39:25,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:39:26,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:39:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:39:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:39:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:39:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:39:29,912][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:39:30,571][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:39:31,232][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:39:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:39:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:39:33,211][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:39:33,871][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:39:34,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:39:35,190][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:39:35,851][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:39:36,510][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:39:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:39:38,134][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:39:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:39:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:39:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:39:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:39:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:39:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:39:42,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:39:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:39:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:39:44,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:39:45,391][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:39:46,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:39:46,711][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:39:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:39:48,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:39:48,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:39:49,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:39:49,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:39:49,914][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:39:51,372][__main__][INFO] - Iteration 357 took 52s (9.76% Gen, 87.44% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 12m 5s. Estimated total time: 14h 30m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 16s. [2026-03-25 19:39:51,377][__main__][INFO] - Starting iteration 357. [2026-03-25 19:39:51,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:39:51,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:39:56,767][__main__][INFO] - Number of regex retries in iteration 357: 0 [2026-03-25 19:39:56,769][__main__][INFO] - agents played in iteration 357 are Alice, Bob [2026-03-25 19:39:57,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:57,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:57,370][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:39:57,371][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:39:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:39:58,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:39:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:39:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:40:00,654][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:40:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:40:01,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:40:02,627][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:40:03,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:40:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:40:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:40:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:40:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:40:06,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:40:07,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:40:07,892][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:40:08,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:40:09,209][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:40:09,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:40:10,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:40:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:40:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:40:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:40:13,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:40:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:40:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:40:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:40:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:40:16,445][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:40:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:40:17,761][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:40:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:40:19,077][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:40:19,734][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:40:20,391][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:40:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:40:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:40:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:40:23,021][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:40:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:40:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:40:24,993][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:40:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:40:26,308][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:40:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:40:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:40:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:40:28,939][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:40:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:40:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:40:31,160][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:40:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:40:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:40:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:40:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:40:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:40:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:40:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:40:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:40:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:40:37,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:40:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:40:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:40:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:40:40,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:40:41,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:40:42,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:40:42,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:40:42,229][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:40:43,695][__main__][INFO] - Iteration 358 took 52s (10.30% Gen, 86.89% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 12m 36s. Estimated total time: 14h 31m 56s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 58s. [2026-03-25 19:40:43,699][__main__][INFO] - Starting iteration 358. [2026-03-25 19:40:43,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:40:43,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:40:48,964][__main__][INFO] - Number of regex retries in iteration 358: 0 [2026-03-25 19:40:48,966][__main__][INFO] - agents played in iteration 358 are Alice, Bob [2026-03-25 19:40:49,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:40:49,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:40:49,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:40:49,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:40:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:40:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:40:51,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:40:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:40:52,921][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:40:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:40:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:40:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:40:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:40:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:40:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:40:57,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:40:58,196][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:40:58,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:40:59,516][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:41:00,175][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:41:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:41:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:41:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:41:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:41:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:41:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:41:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:41:05,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:41:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:41:06,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:41:07,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:41:08,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:41:08,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:41:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:41:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:41:10,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:41:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:41:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:41:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:41:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:41:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:41:14,680][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:41:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:41:15,999][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:41:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:41:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:41:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:41:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:41:19,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:41:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:41:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:41:21,266][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:41:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:41:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:41:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:41:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:41:24,816][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:41:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:41:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:41:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:41:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:41:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:41:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:41:29,427][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:41:30,085][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:41:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:41:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:41:32,061][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:41:32,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:41:33,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:41:34,538][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:41:34,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:41:34,543][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:41:38,747][__main__][INFO] - Iteration 359 took 55s (9.56% Gen, 82.80% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 57m 11s. Estimated total time: 15h 17m 25s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 42s. [2026-03-25 19:41:38,750][__main__][INFO] - Starting iteration 359. [2026-03-25 19:41:38,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:41:38,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:41:43,989][__main__][INFO] - Number of regex retries in iteration 359: 0 [2026-03-25 19:41:43,991][__main__][INFO] - agents played in iteration 359 are Alice, Bob [2026-03-25 19:41:44,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:44,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:44,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:41:44,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:41:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:41:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:41:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:41:47,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:41:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:41:48,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:41:49,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:41:49,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:41:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:41:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:41:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:41:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:41:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:41:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:41:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:41:55,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:41:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:41:56,512][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:41:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:41:57,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:41:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:41:59,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:41:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:42:00,458][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:42:01,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:42:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:42:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:42:03,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:42:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:42:04,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:42:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:42:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:42:06,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:42:07,034][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:42:07,692][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:42:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:42:09,008][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:42:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:42:10,323][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:42:10,981][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:42:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:42:12,296][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:42:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:42:13,611][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:42:14,270][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:42:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:42:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:42:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:42:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:42:17,856][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:42:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:42:19,172][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:42:19,830][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:42:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:42:21,146][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:42:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:42:22,462][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:42:23,120][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:42:23,777][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:42:24,435][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:42:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:42:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:42:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:42:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:42:27,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:42:28,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:42:29,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:42:29,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:42:29,703][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:42:31,077][__main__][INFO] - Iteration 360 took 52s (10.01% Gen, 87.36% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 10m 58s. Estimated total time: 14h 32m 5s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 2s. [2026-03-25 19:42:31,079][__main__][INFO] - Starting iteration 360. [2026-03-25 19:42:31,082][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:42:31,083][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:42:36,149][__main__][INFO] - Number of regex retries in iteration 360: 0 [2026-03-25 19:42:36,150][__main__][INFO] - agents played in iteration 360 are Alice, Bob [2026-03-25 19:42:36,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:42:36,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:42:36,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:42:36,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:42:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:42:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:42:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:42:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:42:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:42:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:42:41,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:42:42,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:42:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:42:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:42:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:42:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:42:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:42:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:42:46,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:42:47,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:42:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:42:48,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:42:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:42:49,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:42:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:42:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:42:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:42:52,586][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:42:53,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:42:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:42:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:42:55,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:42:55,878][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:42:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:42:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:42:57,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:42:58,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:42:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:42:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:43:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:43:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:43:01,804][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:43:02,462][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:43:03,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:43:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:43:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:43:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:43:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:43:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:43:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:43:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:43:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:43:09,281][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:43:09,941][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:43:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:43:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:43:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:43:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:43:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:43:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:43:14,548][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:43:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:43:15,864][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:43:16,522][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:43:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:43:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:43:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:43:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:43:19,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:43:20,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:43:21,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:43:21,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:43:21,668][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:43:22,935][__main__][INFO] - Iteration 361 took 51s (9.77% Gen, 87.78% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 2m 14s. Estimated total time: 14h 24m 13s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 6s. [2026-03-25 19:43:22,937][__main__][INFO] - Starting iteration 361. [2026-03-25 19:43:22,940][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:43:22,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:43:28,063][__main__][INFO] - Number of regex retries in iteration 361: 0 [2026-03-25 19:43:28,064][__main__][INFO] - agents played in iteration 361 are Alice, Bob [2026-03-25 19:43:28,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:43:28,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:43:28,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:43:28,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:43:29,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:43:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:43:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:43:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:43:32,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:43:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:43:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:43:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:43:34,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:43:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:43:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:43:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:43:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:43:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:43:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:43:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:43:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:43:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:43:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:43:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:43:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:43:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:43:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:43:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:43:45,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:43:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:43:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:43:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:43:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:43:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:43:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:43:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:43:50,598][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:43:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:43:51,915][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:43:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:43:53,231][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:43:53,890][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:43:54,548][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:43:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:43:55,864][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:43:56,524][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:43:57,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:43:57,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:43:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:43:59,158][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:43:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:44:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:44:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:44:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:44:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:44:03,411][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:44:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:44:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:44:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:44:06,042][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:44:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:44:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:44:08,018][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:44:08,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:44:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:44:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:44:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:44:11,307][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:44:11,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:44:12,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:44:13,928][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:44:13,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:44:13,931][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:44:15,243][__main__][INFO] - Iteration 362 took 52s (9.80% Gen, 87.69% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 8m 53s. Estimated total time: 14h 31m 44s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 52s. [2026-03-25 19:44:15,245][__main__][INFO] - Starting iteration 362. [2026-03-25 19:44:15,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:44:15,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:44:20,387][__main__][INFO] - Number of regex retries in iteration 362: 0 [2026-03-25 19:44:20,388][__main__][INFO] - agents played in iteration 362 are Alice, Bob [2026-03-25 19:44:21,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:44:21,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:44:21,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:44:21,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:44:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:44:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:44:23,052][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:44:23,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:44:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:44:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:44:25,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:44:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:44:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:44:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:44:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:44:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:44:29,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:44:30,294][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:44:30,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:44:31,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:44:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:44:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:44:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:44:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:44:34,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:44:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:44:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:44:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:44:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:44:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:44:38,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:44:39,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:44:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:44:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:44:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:44:42,146][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:44:42,806][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:44:43,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:44:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:44:44,780][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:44:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:44:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:44:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:44:47,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:44:48,071][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:44:48,729][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:44:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:44:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:44:50,702][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:44:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:44:52,019][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:44:52,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:44:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:44:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:44:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:44:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:44:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:44:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:44:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:44:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:44:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:44:59,506][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:45:00,165][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:45:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:45:01,483][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:45:02,141][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:45:02,799][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:45:03,458][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:45:04,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:45:04,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:45:05,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:45:05,953][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:45:05,954][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:45:07,322][__main__][INFO] - Iteration 363 took 52s (9.87% Gen, 87.50% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 4m 11s. Estimated total time: 14h 27m 54s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 57s. [2026-03-25 19:45:07,325][__main__][INFO] - Starting iteration 363. [2026-03-25 19:45:07,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:45:07,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:45:12,410][__main__][INFO] - Number of regex retries in iteration 363: 0 [2026-03-25 19:45:12,411][__main__][INFO] - agents played in iteration 363 are Alice, Bob [2026-03-25 19:45:13,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:45:13,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:45:13,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:45:13,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:45:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:45:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:45:15,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:45:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:45:16,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:45:17,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:45:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:45:18,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:45:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:45:19,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:45:20,385][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:45:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:45:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:45:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:45:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:45:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:45:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:45:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:45:25,646][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:45:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:45:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:45:27,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:45:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:45:28,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:45:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:45:30,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:45:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:45:31,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:45:32,227][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:45:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:45:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:45:34,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:45:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:45:35,519][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:45:36,178][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:45:36,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:45:37,494][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:45:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:45:38,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:45:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:45:40,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:45:40,784][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:45:41,441][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:45:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:45:42,757][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:45:43,415][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:45:44,073][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:45:44,731][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:45:45,639][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:45:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:45:46,956][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:45:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:45:48,275][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:45:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:45:49,592][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:45:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:45:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:45:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:45:52,224][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:45:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:45:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:45:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:45:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:45:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:45:56,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:45:57,022][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:45:58,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:45:58,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:45:58,154][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:46:00,127][__main__][INFO] - Iteration 364 took 52s (9.62% Gen, 86.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 15m 24s. Estimated total time: 14h 40m 0s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 0s. [2026-03-25 19:46:00,130][__main__][INFO] - Starting iteration 364. [2026-03-25 19:46:00,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:46:00,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:46:05,193][__main__][INFO] - Number of regex retries in iteration 364: 0 [2026-03-25 19:46:05,194][__main__][INFO] - agents played in iteration 364 are Alice, Bob [2026-03-25 19:46:05,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:05,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:05,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:46:05,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:46:06,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:46:07,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:46:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:46:08,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:46:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:46:09,790][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:46:10,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:46:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:46:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:46:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:46:13,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:46:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:46:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:46:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:46:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:46:16,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:46:17,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:46:17,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:46:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:46:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:46:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:46:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:46:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:46:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:46:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:46:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:46:23,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:46:24,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:46:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:46:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:46:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:46:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:46:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:46:28,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:46:28,920][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:46:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:46:30,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:46:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:46:31,558][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:46:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:46:32,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:46:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:46:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:46:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:46:35,519][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:46:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:46:36,839][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:46:37,498][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:46:38,419][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:46:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:46:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:46:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:46:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:46:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:46:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:46:43,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:46:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:46:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:46:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:46:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:46:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:46:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:46:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:46:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:46:48,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:46:49,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:46:50,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:46:50,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:46:50,910][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:46:52,211][__main__][INFO] - Iteration 365 took 52s (9.72% Gen, 87.78% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 2m 30s. Estimated total time: 14h 27m 58s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 59s. [2026-03-25 19:46:52,213][__main__][INFO] - Starting iteration 365. [2026-03-25 19:46:52,217][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:46:52,218][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:46:57,197][__main__][INFO] - Number of regex retries in iteration 365: 0 [2026-03-25 19:46:57,198][__main__][INFO] - agents played in iteration 365 are Alice, Bob [2026-03-25 19:46:57,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:57,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:57,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:46:57,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:46:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:46:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:46:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:47:00,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:47:01,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:47:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:47:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:47:03,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:47:03,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:47:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:47:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:47:05,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:47:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:47:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:47:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:47:08,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:47:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:47:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:47:10,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:47:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:47:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:47:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:47:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:47:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:47:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:47:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:47:15,649][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:47:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:47:16,970][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:47:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:47:18,294][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:47:18,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:47:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:47:20,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:47:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:47:21,595][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:47:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:47:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:47:23,577][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:47:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:47:24,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:47:25,559][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:47:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:47:26,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:47:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:47:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:47:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:47:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:47:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:47:31,169][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:47:31,830][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:47:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:47:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:47:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:47:34,471][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:47:35,131][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:47:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:47:36,451][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:47:37,111][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:47:37,772][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:47:38,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:47:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:47:39,752][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:47:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:47:41,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:47:41,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:47:42,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:47:42,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:47:42,908][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:47:44,459][__main__][INFO] - Iteration 366 took 52s (9.53% Gen, 87.49% Train). Generation: 4s, Training: 45s. Estimated remaining time: 9h 4m 23s. Estimated total time: 14h 30m 43s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 21s. [2026-03-25 19:47:44,462][__main__][INFO] - Starting iteration 366. [2026-03-25 19:47:44,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:47:44,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:47:49,595][__main__][INFO] - Number of regex retries in iteration 366: 0 [2026-03-25 19:47:49,597][__main__][INFO] - agents played in iteration 366 are Alice, Bob [2026-03-25 19:47:50,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:47:50,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:47:50,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:47:50,324][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:47:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:47:51,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:47:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:47:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:47:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:47:54,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:47:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:47:55,604][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:47:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:47:56,925][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:47:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:47:58,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:47:58,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:47:59,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:48:00,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:48:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:48:01,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:48:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:48:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:48:03,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:48:04,183][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:48:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:48:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:48:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:48:06,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:48:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:48:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:48:08,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:48:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:48:10,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:48:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:48:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:48:12,087][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:48:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:48:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:48:14,064][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:48:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:48:15,382][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:48:16,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:48:16,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:48:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:48:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:48:18,678][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:48:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:48:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:48:20,655][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:48:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:48:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:48:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:48:23,548][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:48:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:48:24,865][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:48:25,524][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:48:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:48:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:48:27,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:48:28,161][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:48:28,820][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:48:29,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:48:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:48:30,795][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:48:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:48:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:48:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:48:33,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:48:34,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:48:35,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:48:35,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:48:35,356][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:48:36,626][__main__][INFO] - Iteration 367 took 52s (9.84% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 2m 9s. Estimated total time: 14h 29m 21s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 40s. [2026-03-25 19:48:36,628][__main__][INFO] - Starting iteration 367. [2026-03-25 19:48:36,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:48:36,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:48:41,595][__main__][INFO] - Number of regex retries in iteration 367: 0 [2026-03-25 19:48:41,596][__main__][INFO] - agents played in iteration 367 are Alice, Bob [2026-03-25 19:48:42,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:42,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:42,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:48:42,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:48:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:48:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:48:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:48:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:48:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:48:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:48:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:48:47,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:48:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:48:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:48:49,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:48:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:48:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:48:51,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:48:52,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:48:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:48:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:48:54,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:48:54,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:48:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:48:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:48:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:48:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:48:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:48:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:48:59,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:48:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:49:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:49:01,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:49:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:49:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:49:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:49:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:49:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:49:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:49:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:49:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:49:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:49:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:49:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:49:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:49:09,830][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:49:10,488][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:49:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:49:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:49:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:49:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:49:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:49:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:49:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:49:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:49:16,674][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:49:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:49:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:49:18,648][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:49:19,306][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:49:19,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:49:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:49:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:49:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:49:22,595][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:49:23,252][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:49:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:49:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:49:25,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:49:25,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:49:27,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:49:27,049][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:49:27,051][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:49:28,464][__main__][INFO] - Iteration 368 took 51s (9.58% Gen, 87.69% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 55m 49s. Estimated total time: 14h 23m 53s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 56s. [2026-03-25 19:49:28,466][__main__][INFO] - Starting iteration 368. [2026-03-25 19:49:28,470][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:49:28,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:49:33,495][__main__][INFO] - Number of regex retries in iteration 368: 0 [2026-03-25 19:49:33,496][__main__][INFO] - agents played in iteration 368 are Alice, Bob [2026-03-25 19:49:34,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:49:34,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:49:34,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:49:34,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:49:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:49:35,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:49:36,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:49:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:49:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:49:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:49:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:49:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:49:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:49:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:49:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:49:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:49:42,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:49:43,335][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:49:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:49:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:49:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:49:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:49:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:49:47,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:49:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:49:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:49:49,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:49:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:49:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:49:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:49:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:49:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:49:53,202][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:49:53,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:49:54,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:49:55,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:49:55,834][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:49:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:49:57,152][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:49:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:49:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:49:59,124][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:49:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:50:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:50:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:50:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:50:02,412][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:50:03,069][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:50:03,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:50:04,384][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:50:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:50:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:50:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:50:07,317][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:50:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:50:08,632][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:50:09,290][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:50:09,948][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:50:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:50:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:50:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:50:12,578][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:50:13,235][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:50:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:50:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:50:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:50:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:50:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:50:17,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:50:18,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:50:19,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:50:19,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:50:19,412][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:50:20,757][__main__][INFO] - Iteration 369 took 52s (9.61% Gen, 87.81% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 2m 32s. Estimated total time: 14h 31m 28s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 44s. [2026-03-25 19:50:20,760][__main__][INFO] - Starting iteration 369. [2026-03-25 19:50:20,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:50:20,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:50:26,070][__main__][INFO] - Number of regex retries in iteration 369: 0 [2026-03-25 19:50:26,071][__main__][INFO] - agents played in iteration 369 are Alice, Bob [2026-03-25 19:50:26,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:50:26,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:50:26,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:50:26,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:50:27,438][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:50:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:50:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:50:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:50:30,027][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:50:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:50:31,346][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:50:32,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:50:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:50:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:50:33,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:50:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:50:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:50:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:50:36,623][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:50:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:50:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:50:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:50:39,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:50:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:50:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:50:41,243][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:50:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:50:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:50:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:50:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:50:44,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:50:45,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:50:45,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:50:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:50:47,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:50:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:50:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:50:49,161][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:50:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:50:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:50:51,137][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:50:51,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:50:52,456][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:50:53,115][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:50:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:50:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:50:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:50:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:50:56,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:50:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:50:57,733][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:50:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:50:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:51:00,046][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:51:00,706][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:51:01,365][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:51:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:51:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:51:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:51:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:51:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:51:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:51:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:51:06,643][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:51:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:51:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:51:08,620][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:51:09,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:51:09,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:51:10,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:51:11,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:51:11,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:51:11,829][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:51:13,342][__main__][INFO] - Iteration 370 took 52s (10.09% Gen, 87.02% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 6m 30s. Estimated total time: 14h 36m 19s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 9s. [2026-03-25 19:51:13,345][__main__][INFO] - Starting iteration 370. [2026-03-25 19:51:13,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:51:13,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:51:18,453][__main__][INFO] - Number of regex retries in iteration 370: 0 [2026-03-25 19:51:18,455][__main__][INFO] - agents played in iteration 370 are Alice, Bob [2026-03-25 19:51:19,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:51:19,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:51:19,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:51:19,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:51:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:51:20,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:51:21,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:51:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:51:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:51:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:51:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:51:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:51:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:51:25,677][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:51:26,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:51:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:51:27,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:51:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:51:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:51:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:51:30,293][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:51:30,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:51:31,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:51:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:51:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:51:33,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:51:34,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:51:34,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:51:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:51:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:51:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:51:37,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:51:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:51:38,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:51:39,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:51:40,183][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:51:40,843][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:51:41,502][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:51:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:51:42,821][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:51:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:51:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:51:44,799][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:51:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:51:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:51:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:51:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:51:48,096][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:51:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:51:49,416][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:51:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:51:50,738][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:51:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:51:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:51:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:51:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:51:54,336][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:51:54,995][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:51:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:51:56,316][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:51:56,975][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:51:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:51:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:51:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:51:59,614][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:52:00,274][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:52:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:52:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:52:02,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:52:03,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:52:04,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:52:04,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:52:04,229][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:52:05,465][__main__][INFO] - Iteration 371 took 52s (9.80% Gen, 87.83% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 57m 57s. Estimated total time: 14h 28m 38s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 19s. [2026-03-25 19:52:05,468][__main__][INFO] - Starting iteration 371. [2026-03-25 19:52:05,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:52:05,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:52:16,258][__main__][INFO] - Number of regex retries in iteration 371: 0 [2026-03-25 19:52:16,259][__main__][INFO] - agents played in iteration 371 are Alice, Bob [2026-03-25 19:52:16,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:52:16,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:52:16,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:52:16,860][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:52:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:52:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:52:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:52:19,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:52:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:52:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:52:21,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:52:22,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:52:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:52:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:52:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:52:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:52:25,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:52:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:52:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:52:27,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:52:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:52:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:52:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:52:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:52:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:52:31,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:52:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:52:32,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:52:33,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:52:33,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:52:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:52:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:52:35,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:52:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:52:37,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:52:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:52:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:52:39,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:52:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:52:40,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:52:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:52:41,876][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:52:42,536][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:52:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:52:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:52:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:52:45,185][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:52:45,845][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:52:46,505][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:52:47,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:52:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:52:48,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:52:49,482][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:52:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:52:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:52:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:52:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:52:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:52:53,440][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:52:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:52:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:52:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:52:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:52:56,740][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:52:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:52:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:52:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:52:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:53:00,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:53:00,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:53:02,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:53:02,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:53:02,016][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:53:03,299][__main__][INFO] - Iteration 372 took 57s (18.65% Gen, 79.13% Train). Generation: 10s, Training: 45s. Estimated remaining time: 10h 32m 9s. Estimated total time: 16h 3m 48s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 22s, 500 more iterations: 8h 1m 54s. [2026-03-25 19:53:03,302][__main__][INFO] - Starting iteration 372. [2026-03-25 19:53:03,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:53:03,308][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:53:08,409][__main__][INFO] - Number of regex retries in iteration 372: 0 [2026-03-25 19:53:08,410][__main__][INFO] - agents played in iteration 372 are Alice, Bob [2026-03-25 19:53:09,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:53:09,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:53:09,087][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:53:09,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:53:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:53:10,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:53:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:53:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:53:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:53:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:53:13,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:53:14,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:53:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:53:15,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:53:16,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:53:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:53:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:53:18,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:53:18,930][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:53:19,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:53:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:53:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:53:21,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:53:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:53:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:53:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:53:24,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:53:24,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:53:25,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:53:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:53:26,854][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:53:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:53:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:53:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:53:29,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:53:30,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:53:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:53:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:53:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:53:32,792][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:53:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:53:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:53:34,771][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:53:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:53:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:53:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:53:37,408][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:53:38,069][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:53:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:53:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:53:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:53:40,706][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:53:41,698][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:53:42,356][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:53:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:53:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:53:44,336][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:53:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:53:45,656][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:53:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:53:46,975][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:53:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:53:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:53:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:53:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:53:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:53:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:53:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:53:52,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:53:53,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:53:54,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:53:54,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:53:54,252][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:53:55,598][__main__][INFO] - Iteration 373 took 52s (9.76% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 59m 1s. Estimated total time: 14h 31m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 46s. [2026-03-25 19:53:55,601][__main__][INFO] - Starting iteration 373. [2026-03-25 19:53:55,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:53:55,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:54:05,665][__main__][INFO] - Number of regex retries in iteration 373: 0 [2026-03-25 19:54:05,666][__main__][INFO] - agents played in iteration 373 are Alice, Bob [2026-03-25 19:54:06,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:54:06,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:54:06,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:54:06,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:54:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:54:07,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:54:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:54:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:54:09,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:54:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:54:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:54:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:54:12,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:54:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:54:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:54:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:54:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:54:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:54:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:54:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:54:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:54:18,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:54:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:54:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:54:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:54:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:54:21,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:54:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:54:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:54:23,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:54:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:54:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:54:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:54:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:54:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:54:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:54:28,035][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:54:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:54:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:54:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:54:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:54:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:54:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:54:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:54:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:54:33,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:54:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:54:35,284][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:54:35,943][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:54:36,603][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:54:37,261][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:54:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:54:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:54:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:54:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:54:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:54:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:54:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:54:42,792][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:54:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:54:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:54:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:54:45,430][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:54:46,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:54:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:54:47,408][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:54:48,068][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:54:48,728][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:54:49,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:54:50,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:54:51,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:54:51,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:54:51,288][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:54:52,656][__main__][INFO] - Iteration 374 took 57s (17.63% Gen, 79.96% Train). Generation: 10s, Training: 45s. Estimated remaining time: 10h 17m 23s. Estimated total time: 15h 50m 52s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 5s, 500 more iterations: 7h 55m 26s. [2026-03-25 19:54:52,658][__main__][INFO] - Starting iteration 374. [2026-03-25 19:54:52,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:54:52,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:55:01,228][__main__][INFO] - Number of regex retries in iteration 374: 0 [2026-03-25 19:55:01,231][__main__][INFO] - agents played in iteration 374 are Alice, Bob [2026-03-25 19:55:01,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:01,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:01,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:55:01,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:55:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:55:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:55:03,891][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:55:04,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:55:05,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:55:05,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:55:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:55:07,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:55:07,844][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:55:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:55:09,165][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:55:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:55:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:55:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:55:11,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:55:12,463][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:55:13,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:55:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:55:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:55:15,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:55:15,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:55:16,421][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:55:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:55:17,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:55:18,400][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:55:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:55:19,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:55:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:55:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:55:21,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:55:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:55:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:55:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:55:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:55:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:55:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:55:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:55:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:55:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:55:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:55:28,964][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:55:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:55:30,283][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:55:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:55:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:55:32,264][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:55:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:55:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:55:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:55:35,250][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:55:35,909][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:55:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:55:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:55:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:55:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:55:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:55:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:55:40,527][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:55:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:55:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:55:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:55:43,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:55:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:55:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:55:45,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:55:46,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:55:49,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:55:50,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:55:50,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:55:51,381][__main__][INFO] - Iteration 375 took 58s (14.59% Gen, 83.06% Train). Generation: 8s, Training: 48s. Estimated remaining time: 10h 44m 13s. Estimated total time: 16h 18m 40s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 52s, 500 more iterations: 8h 9m 20s. [2026-03-25 19:55:51,383][__main__][INFO] - Starting iteration 375. [2026-03-25 19:55:51,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:55:51,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:55:56,369][__main__][INFO] - Number of regex retries in iteration 375: 0 [2026-03-25 19:55:56,370][__main__][INFO] - agents played in iteration 375 are Alice, Bob [2026-03-25 19:55:56,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:56,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:56,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:55:56,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:55:57,677][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:55:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:55:58,951][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:55:59,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:56:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:56:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:56:01,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:56:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:56:02,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:56:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:56:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:56:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:56:05,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:56:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:56:06,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:56:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:56:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:56:08,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:56:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:56:10,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:56:10,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:56:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:56:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:56:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:56:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:56:14,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:56:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:56:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:56:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:56:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:56:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:56:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:56:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:56:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:56:20,034][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:56:20,694][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:56:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:56:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:56:22,668][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:56:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:56:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:56:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:56:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:56:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:56:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:56:27,272][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:56:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:56:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:56:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:56:30,179][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:56:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:56:31,494][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:56:32,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:56:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:56:33,468][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:56:34,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:56:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:56:35,443][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:56:36,101][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:56:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:56:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:56:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:56:38,735][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:56:39,394][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:56:40,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:56:40,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:56:42,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:56:42,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:56:42,084][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:56:43,411][__main__][INFO] - Iteration 376 took 52s (9.58% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 51m 46s. Estimated total time: 14h 27m 5s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 32s. [2026-03-25 19:56:43,414][__main__][INFO] - Starting iteration 376. [2026-03-25 19:56:43,418][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:56:43,418][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:56:48,435][__main__][INFO] - Number of regex retries in iteration 376: 0 [2026-03-25 19:56:48,437][__main__][INFO] - agents played in iteration 376 are Alice, Bob [2026-03-25 19:56:49,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:56:49,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:56:49,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:56:49,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:56:49,781][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:56:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:56:51,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:56:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:56:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:56:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:56:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:56:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:56:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:56:55,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:56:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:56:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:56:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:56:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:56:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:56:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:57:00,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:57:00,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:57:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:57:02,251][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:57:02,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:57:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:57:04,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:57:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:57:05,543][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:57:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:57:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:57:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:57:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:57:08,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:57:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:57:10,156][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:57:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:57:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:57:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:57:12,791][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:57:13,449][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:57:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:57:14,769][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:57:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:57:16,086][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:57:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:57:17,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:57:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:57:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:57:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:57:20,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:57:20,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:57:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:57:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:57:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:57:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:57:24,288][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:57:24,947][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:57:25,605][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:57:26,263][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:57:26,922][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:57:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:57:28,239][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:57:28,898][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:57:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:57:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:57:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:57:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:57:32,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:57:32,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:57:34,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:57:34,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:57:34,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:57:35,535][__main__][INFO] - Iteration 377 took 52s (9.63% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 52m 27s. Estimated total time: 14h 28m 38s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 19s. [2026-03-25 19:57:35,538][__main__][INFO] - Starting iteration 377. [2026-03-25 19:57:35,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:57:35,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:57:40,401][__main__][INFO] - Number of regex retries in iteration 377: 0 [2026-03-25 19:57:40,403][__main__][INFO] - agents played in iteration 377 are Alice, Bob [2026-03-25 19:57:41,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:57:41,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:57:41,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:57:41,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:57:41,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:57:42,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:57:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:57:43,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:57:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:57:45,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:57:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:57:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:57:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:57:47,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:57:48,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:57:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:57:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:57:50,284][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:57:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:57:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:57:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:57:52,915][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:57:53,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:57:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:57:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:57:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:57:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:57:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:57:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:57:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:57:58,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:57:59,496][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:58:00,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:58:00,813][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:58:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:58:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:58:02,789][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:58:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:58:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:58:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:58:05,423][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:58:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:58:06,739][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:58:07,396][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:58:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:58:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:58:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:58:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:58:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:58:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:58:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:58:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:58:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:58:14,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:58:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:58:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:58:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:58:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:58:17,497][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:58:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:58:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:58:19,472][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:58:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:58:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:58:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:58:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:58:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:58:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:58:24,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:58:24,786][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:58:25,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:58:25,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:58:25,930][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:58:27,202][__main__][INFO] - Iteration 378 took 51s (9.41% Gen, 88.12% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 43m 59s. Estimated total time: 14h 21m 2s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 31s. [2026-03-25 19:58:27,206][__main__][INFO] - Starting iteration 378. [2026-03-25 19:58:27,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:58:27,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:58:34,937][__main__][INFO] - Number of regex retries in iteration 378: 0 [2026-03-25 19:58:34,939][__main__][INFO] - agents played in iteration 378 are Alice, Bob [2026-03-25 19:58:35,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:58:35,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:58:35,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:58:35,628][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:58:36,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:58:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:58:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:58:38,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:58:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:58:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:58:40,201][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:58:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:58:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:58:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:58:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:58:43,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:58:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:58:44,811][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:58:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:58:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:58:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:58:47,440][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:58:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:58:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:58:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:58:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:58:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:58:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:58:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:58:52,700][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:58:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:58:54,015][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:58:54,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:58:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:58:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:58:56,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:58:57,304][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:58:57,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:58:58,619][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:58:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:58:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:59:00,592][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:59:01,249][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:59:01,907][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:59:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:59:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:59:03,880][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:59:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:59:05,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:59:05,853][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:59:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:59:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:59:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:59:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:59:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:59:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:59:10,719][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:59:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:59:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:59:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:59:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:59:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:59:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:59:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:59:15,980][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:59:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:59:17,296][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:59:17,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:59:18,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:59:19,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:59:20,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:59:20,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:59:20,653][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:59:21,910][__main__][INFO] - Iteration 379 took 54s (14.13% Gen, 83.57% Train). Generation: 7s, Training: 45s. Estimated remaining time: 9h 33m 44s. Estimated total time: 15h 11m 41s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 50s. [2026-03-25 19:59:21,912][__main__][INFO] - Starting iteration 379. [2026-03-25 19:59:21,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:59:21,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:59:35,842][__main__][INFO] - Number of regex retries in iteration 379: 0 [2026-03-25 19:59:35,843][__main__][INFO] - agents played in iteration 379 are Alice, Bob [2026-03-25 19:59:36,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:59:36,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:59:36,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:59:36,430][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:59:37,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:59:37,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:59:38,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:59:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:59:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:59:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:59:40,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:59:41,649][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:59:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:59:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:59:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:59:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:59:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:59:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:59:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:59:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:59:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:59:48,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:59:48,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:59:49,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:59:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:59:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:59:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:59:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:59:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:59:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:59:54,152][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:59:54,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:59:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:59:56,127][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:59:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:59:57,443][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:59:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:59:58,759][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:59:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:00:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:00:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:00:01,391][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:00:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:00:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:00:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:00:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:00:04,682][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:00:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:00:05,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:00:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:00:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:00:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:00:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:00:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:00:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:00:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:00:11,497][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:00:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:00:12,815][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:00:13,473][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:00:14,133][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:00:14,791][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:00:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:00:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:00:16,766][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:00:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:00:18,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:00:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:00:19,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:00:20,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:00:21,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:00:21,254][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:00:21,255][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:00:22,679][__main__][INFO] - Iteration 380 took 1m 0s (22.92% Gen, 74.73% Train). Generation: 13s, Training: 45s. Estimated remaining time: 11h 13m 46s. Estimated total time: 16h 52m 45s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 16s, 500 more iterations: 8h 26m 22s. [2026-03-25 20:00:22,682][__main__][INFO] - Starting iteration 380. [2026-03-25 20:00:22,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:00:22,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:00:38,973][__main__][INFO] - Number of regex retries in iteration 380: 0 [2026-03-25 20:00:38,975][__main__][INFO] - agents played in iteration 380 are Alice, Bob [2026-03-25 20:00:39,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:00:39,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:00:39,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:00:39,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:00:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:00:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:00:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:00:42,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:00:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:00:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:00:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:00:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:00:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:00:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:00:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:00:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:00:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:00:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:00:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:00:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:00:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:00:51,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:00:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:00:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:00:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:00:54,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:00:54,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:00:55,393][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:00:56,051][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:00:56,709][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:00:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:00:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:00:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:00:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:00:59,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:01:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:01:01,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:01:01,970][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:01:02,627][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:01:03,285][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:01:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:01:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:01:05,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:01:05,915][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:01:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:01:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:01:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:01:08,545][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:01:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:01:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:01:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:01:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:01:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:01:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:01:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:01:14,077][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:01:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:01:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:01:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:01:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:01:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:01:18,023][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:01:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:01:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:01:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:01:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:01:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:01:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:01:22,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:01:23,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:01:24,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:01:24,501][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:01:24,502][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:01:25,874][__main__][INFO] - Iteration 381 took 1m 3s (25.78% Gen, 72.05% Train). Generation: 16s, Training: 45s. Estimated remaining time: 11h 53m 9s. Estimated total time: 17h 33m 11s. Time estimates for 10 more iterations: 10m 31s, 100 more iterations: 1h 45m 19s, 500 more iterations: 8h 46m 35s. [2026-03-25 20:01:25,877][__main__][INFO] - Starting iteration 381. [2026-03-25 20:01:25,881][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:01:25,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:01:30,892][__main__][INFO] - Number of regex retries in iteration 381: 0 [2026-03-25 20:01:30,893][__main__][INFO] - agents played in iteration 381 are Alice, Bob [2026-03-25 20:01:31,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:01:31,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:01:31,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:01:31,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:01:32,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:01:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:01:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:01:34,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:01:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:01:35,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:01:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:01:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:01:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:01:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:01:38,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:01:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:01:40,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:01:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:01:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:01:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:01:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:01:43,346][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:01:44,005][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:01:44,663][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:01:45,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:01:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:01:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:01:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:01:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:01:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:01:49,270][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:01:49,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:01:50,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:01:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:01:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:01:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:01:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:01:53,876][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:01:54,533][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:01:55,191][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:01:55,850][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:01:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:01:57,166][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:01:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:01:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:01:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:01:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:02:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:02:01,111][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:02:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:02:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:02:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:02:03,974][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:02:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:02:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:02:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:02:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:02:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:02:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:02:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:02:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:02:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:02:10,555][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:02:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:02:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:02:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:02:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:02:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:02:14,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:02:15,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:02:16,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:02:16,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:02:16,753][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:02:24,008][__main__][INFO] - Iteration 382 took 58s (8.62% Gen, 78.89% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 27m 49s. Estimated total time: 16h 8m 49s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 52s, 500 more iterations: 8h 4m 24s. [2026-03-25 20:02:24,012][__main__][INFO] - Starting iteration 382. [2026-03-25 20:02:24,016][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:02:24,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:02:34,454][__main__][INFO] - Number of regex retries in iteration 382: 0 [2026-03-25 20:02:34,456][__main__][INFO] - agents played in iteration 382 are Alice, Bob [2026-03-25 20:02:35,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:02:35,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:02:35,094][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:02:35,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:02:35,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:02:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:02:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:02:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:02:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:02:39,058][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:02:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:02:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:02:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:02:41,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:02:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:02:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:02:43,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:02:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:02:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:02:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:02:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:02:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:02:47,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:02:48,284][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:02:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:02:49,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:02:50,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:02:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:02:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:02:52,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:02:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:02:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:02:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:02:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:02:55,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:02:56,192][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:02:56,851][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:02:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:02:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:02:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:02:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:03:00,146][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:03:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:03:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:03:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:03:02,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:03:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:03:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:03:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:03:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:03:06,065][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:03:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:03:07,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:03:08,319][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:03:08,977][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:03:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:03:10,294][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:03:10,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:03:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:03:12,267][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:03:12,924][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:03:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:03:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:03:14,902][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:03:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:03:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:03:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:03:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:03:18,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:03:20,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:03:21,494][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:03:21,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:03:21,498][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:03:22,913][__main__][INFO] - Iteration 383 took 58s (17.72% Gen, 79.87% Train). Generation: 10s, Training: 47s. Estimated remaining time: 10h 39m 40s. Estimated total time: 16h 21m 39s. Time estimates for 10 more iterations: 9m 48s, 100 more iterations: 1h 38m 9s, 500 more iterations: 8h 10m 49s. [2026-03-25 20:03:22,916][__main__][INFO] - Starting iteration 383. [2026-03-25 20:03:22,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:03:22,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:03:27,806][__main__][INFO] - Number of regex retries in iteration 383: 0 [2026-03-25 20:03:27,807][__main__][INFO] - agents played in iteration 383 are Alice, Bob [2026-03-25 20:03:28,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:03:28,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:03:28,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:03:28,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:03:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:03:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:03:30,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:03:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:03:31,749][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:03:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:03:33,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:03:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:03:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:03:35,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:03:35,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:03:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:03:37,016][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:03:37,674][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:03:38,333][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:03:38,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:03:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:03:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:03:40,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:03:41,626][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:03:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:03:42,944][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:03:43,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:03:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:03:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:03:45,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:03:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:03:46,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:03:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:03:48,220][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:03:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:03:49,537][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:03:50,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:03:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:03:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:03:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:03:52,837][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:03:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:03:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:03:54,818][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:03:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:03:56,138][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:03:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:03:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:03:58,118][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:03:58,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:03:59,436][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:04:00,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:04:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:04:01,653][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:04:02,311][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:04:02,970][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:04:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:04:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:04:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:04:05,607][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:04:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:04:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:04:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:04:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:04:08,902][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:04:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:04:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:04:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:04:11,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:04:12,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:04:13,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:04:13,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:04:13,378][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:04:14,703][__main__][INFO] - Iteration 384 took 51s (9.44% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 40m 14s. Estimated total time: 14h 23m 5s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 32s. [2026-03-25 20:04:14,706][__main__][INFO] - Starting iteration 384. [2026-03-25 20:04:14,710][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:04:14,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:04:23,407][__main__][INFO] - Number of regex retries in iteration 384: 0 [2026-03-25 20:04:23,408][__main__][INFO] - agents played in iteration 384 are Alice, Bob [2026-03-25 20:04:24,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:04:24,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:04:24,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:04:24,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:04:24,783][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:04:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:04:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:04:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:04:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:04:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:04:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:04:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:04:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:04:30,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:04:31,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:04:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:04:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:04:33,293][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:04:33,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:04:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:04:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:04:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:04:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:04:37,243][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:04:37,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:04:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:04:39,217][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:04:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:04:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:04:41,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:04:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:04:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:04:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:04:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:04:44,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:04:45,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:04:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:04:46,459][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:04:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:04:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:04:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:04:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:04:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:04:50,409][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:04:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:04:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:04:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:04:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:04:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:04:54,358][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:04:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:04:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:04:56,602][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:04:57,262][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:04:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:04:58,578][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:04:59,237][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:04:59,895][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:05:00,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:05:01,213][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:05:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:05:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:05:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:05:03,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:05:04,508][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:05:05,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:05:05,826][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:05:06,484][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:05:07,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:05:08,010][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:05:09,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:05:09,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:05:09,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:05:10,452][__main__][INFO] - Iteration 385 took 55s (15.60% Gen, 82.03% Train). Generation: 8s, Training: 45s. Estimated remaining time: 9h 45m 17s. Estimated total time: 15h 29m 3s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 31s. [2026-03-25 20:05:10,455][__main__][INFO] - Starting iteration 385. [2026-03-25 20:05:10,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:05:10,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:05:24,147][__main__][INFO] - Number of regex retries in iteration 385: 0 [2026-03-25 20:05:24,149][__main__][INFO] - agents played in iteration 385 are Alice, Bob [2026-03-25 20:05:24,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:05:24,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:05:24,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:05:24,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:05:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:05:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:05:26,740][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:05:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:05:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:05:28,712][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:05:29,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:05:30,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:05:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:05:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:05:32,003][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:05:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:05:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:05:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:05:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:05:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:05:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:05:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:05:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:05:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:05:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:05:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:05:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:05:40,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:05:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:05:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:05:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:05:43,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:05:43,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:05:44,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:05:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:05:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:05:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:05:47,159][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:05:47,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:05:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:05:49,135][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:05:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:05:50,453][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:05:51,111][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:05:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:05:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:05:53,089][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:05:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:05:54,409][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:05:55,067][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:05:55,726][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:05:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:05:57,279][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:05:57,937][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:05:58,595][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:05:59,253][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:05:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:06:00,570][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:06:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:06:01,894][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:06:02,553][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:06:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:06:03,872][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:06:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:06:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:06:05,848][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:06:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:06:07,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:06:07,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:06:08,669][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:06:16,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:06:16,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:06:16,019][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:06:17,250][__main__][INFO] - Iteration 386 took 1m 6s (20.49% Gen, 77.66% Train). Generation: 13s, Training: 51s. Estimated remaining time: 12h 48m 20s. Estimated total time: 18h 33m 13s. Time estimates for 10 more iterations: 11m 7s, 100 more iterations: 1h 51m 19s, 500 more iterations: 9h 16m 36s. [2026-03-25 20:06:17,253][__main__][INFO] - Starting iteration 386. [2026-03-25 20:06:17,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:06:17,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:06:22,400][__main__][INFO] - Number of regex retries in iteration 386: 0 [2026-03-25 20:06:22,402][__main__][INFO] - agents played in iteration 386 are Alice, Bob [2026-03-25 20:06:22,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:06:22,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:06:22,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:06:22,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:06:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:06:24,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:06:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:06:25,614][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:06:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:06:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:06:27,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:06:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:06:28,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:06:29,562][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:06:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:06:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:06:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:06:32,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:06:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:06:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:06:34,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:06:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:06:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:06:36,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:06:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:06:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:06:38,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:06:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:06:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:06:40,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:06:40,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:06:41,404][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:06:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:06:42,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:06:43,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:06:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:06:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:06:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:06:46,015][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:06:46,673][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:06:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:06:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:06:48,649][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:06:49,307][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:06:49,965][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:06:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:06:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:06:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:06:52,600][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:06:53,258][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:06:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:06:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:06:55,508][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:06:56,169][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:06:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:06:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:06:58,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:06:58,803][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:06:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:07:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:07:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:07:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:07:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:07:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:07:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:07:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:07:04,730][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:07:05,389][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:07:06,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:07:06,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:07:07,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:07:07,872][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:07:07,874][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:07:09,225][__main__][INFO] - Iteration 387 took 51s (9.90% Gen, 87.50% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 40m 25s. Estimated total time: 14h 26m 10s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 5s. [2026-03-25 20:07:09,228][__main__][INFO] - Starting iteration 387. [2026-03-25 20:07:09,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:07:09,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:07:10,691][mllm.models.large_language_model_local][WARNING] - Response . did not match regex: (|), retry 1/1 [2026-03-25 20:07:14,805][__main__][INFO] - Number of regex retries in iteration 387: 1 [2026-03-25 20:07:14,806][__main__][INFO] - agents played in iteration 387 are Alice, Bob [2026-03-25 20:07:15,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:07:15,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:07:15,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:07:15,500][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:07:16,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:07:16,769][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:07:17,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:07:18,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:07:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:07:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:07:20,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:07:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:07:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:07:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:07:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:07:23,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:07:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:07:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:07:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:07:25,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:07:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:07:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:07:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:07:28,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:07:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:07:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:07:30,576][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:07:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:07:31,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:07:32,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:07:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:07:33,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:07:34,523][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:07:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:07:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:07:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:07:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:07:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:07:38,468][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:07:39,126][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:07:39,783][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:07:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:07:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:07:41,757][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:07:42,415][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:07:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:07:43,730][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:07:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:07:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:07:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:07:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:07:47,021][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:07:47,904][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:07:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:07:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:07:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:07:50,535][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:07:51,193][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:07:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:07:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:07:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:07:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:07:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:07:55,140][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:07:55,799][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:07:56,458][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:07:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:07:57,773][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:07:58,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:07:59,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:08:00,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:08:00,425][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:08:00,427][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:08:01,772][__main__][INFO] - Iteration 388 took 52s (10.61% Gen, 86.83% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 49m 4s. Estimated total time: 14h 35m 42s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 51s. [2026-03-25 20:08:01,775][__main__][INFO] - Starting iteration 388. [2026-03-25 20:08:01,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:08:01,779][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:08:06,832][__main__][INFO] - Number of regex retries in iteration 388: 0 [2026-03-25 20:08:06,834][__main__][INFO] - agents played in iteration 388 are Alice, Bob [2026-03-25 20:08:07,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:07,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:07,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:08:07,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:08:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:08:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:08:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:08:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:08:10,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:08:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:08:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:08:12,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:08:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:08:13,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:08:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:08:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:08:15,903][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:08:16,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:08:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:08:17,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:08:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:08:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:08:19,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:08:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:08:21,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:08:21,828][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:08:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:08:23,144][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:08:23,803][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:08:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:08:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:08:25,778][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:08:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:08:27,096][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:08:27,755][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:08:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:08:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:08:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:08:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:08:31,049][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:08:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:08:32,367][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:08:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:08:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:08:34,345][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:08:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:08:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:08:36,322][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:08:36,981][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:08:37,641][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:08:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:08:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:08:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:08:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:08:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:08:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:08:42,611][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:08:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:08:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:08:44,591][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:08:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:08:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:08:46,571][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:08:47,231][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:08:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:08:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:08:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:08:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:08:50,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:08:51,346][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:08:52,472][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:08:52,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:08:52,477][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:08:54,035][__main__][INFO] - Iteration 389 took 52s (9.67% Gen, 87.34% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 43m 28s. Estimated total time: 14h 30m 58s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 29s. [2026-03-25 20:08:54,038][__main__][INFO] - Starting iteration 389. [2026-03-25 20:08:54,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:08:54,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:08:59,184][__main__][INFO] - Number of regex retries in iteration 389: 0 [2026-03-25 20:08:59,185][__main__][INFO] - agents played in iteration 389 are Alice, Bob [2026-03-25 20:08:59,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:59,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:59,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:08:59,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:09:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:09:01,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:09:01,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:09:02,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:09:03,110][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:09:03,767][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:09:04,426][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:09:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:09:05,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:09:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:09:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:09:07,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:09:08,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:09:09,030][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:09:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:09:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:09:11,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:09:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:09:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:09:12,979][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:09:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:09:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:09:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:09:15,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:09:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:09:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:09:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:09:18,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:09:18,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:09:19,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:09:20,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:09:20,877][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:09:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:09:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:09:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:09:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:09:24,174][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:09:24,833][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:09:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:09:26,150][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:09:26,811][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:09:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:09:28,129][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:09:28,787][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:09:29,445][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:09:30,104][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:09:30,760][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:09:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:09:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:09:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:09:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:09:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:09:35,054][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:09:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:09:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:09:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:09:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:09:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:09:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:09:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:09:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:09:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:09:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:09:42,299][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:09:42,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:09:43,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:09:44,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:09:44,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:09:44,815][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:09:46,152][__main__][INFO] - Iteration 390 took 52s (9.87% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 40m 10s. Estimated total time: 14h 28m 32s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 16s. [2026-03-25 20:09:46,155][__main__][INFO] - Starting iteration 390. [2026-03-25 20:09:46,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:09:46,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:09:51,434][__main__][INFO] - Number of regex retries in iteration 390: 0 [2026-03-25 20:09:51,435][__main__][INFO] - agents played in iteration 390 are Alice, Bob [2026-03-25 20:09:52,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:09:52,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:09:52,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:09:52,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:09:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:09:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:09:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:09:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:09:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:09:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:09:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:09:57,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:09:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:09:58,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:09:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:10:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:10:00,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:10:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:10:02,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:10:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:10:03,333][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:10:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:10:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:10:05,307][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:10:05,965][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:10:06,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:10:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:10:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:10:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:10:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:10:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:10:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:10:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:10:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:10:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:10:13,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:10:13,863][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:10:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:10:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:10:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:10:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:10:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:10:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:10:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:10:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:10:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:10:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:10:21,102][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:10:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:10:22,418][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:10:23,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:10:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:10:24,694][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:10:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:10:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:10:26,670][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:10:27,329][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:10:27,987][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:10:28,646][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:10:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:10:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:10:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:10:31,280][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:10:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:10:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:10:33,256][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:10:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:10:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:10:35,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:10:35,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:10:37,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:10:37,064][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:10:37,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:10:38,327][__main__][INFO] - Iteration 391 took 52s (10.11% Gen, 87.46% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 40m 16s. Estimated total time: 14h 29m 30s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 45s. [2026-03-25 20:10:38,330][__main__][INFO] - Starting iteration 391. [2026-03-25 20:10:38,333][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:10:38,334][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:10:43,431][__main__][INFO] - Number of regex retries in iteration 391: 0 [2026-03-25 20:10:43,432][__main__][INFO] - agents played in iteration 391 are Alice, Bob [2026-03-25 20:10:44,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:10:44,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:10:44,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:10:44,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:10:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:10:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:10:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:10:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:10:47,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:10:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:10:48,747][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:10:49,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:10:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:10:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:10:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:10:52,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:10:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:10:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:10:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:10:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:10:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:10:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:10:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:10:57,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:10:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:10:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:10:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:10:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:11:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:11:01,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:11:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:11:02,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:11:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:11:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:11:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:11:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:11:05,867][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:11:06,525][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:11:07,185][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:11:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:11:08,504][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:11:09,163][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:11:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:11:10,480][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:11:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:11:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:11:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:11:13,113][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:11:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:11:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:11:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:11:15,746][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:11:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:11:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:11:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:11:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:11:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:11:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:11:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:11:21,355][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:11:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:11:22,674][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:11:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:11:23,991][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:11:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:11:25,307][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:11:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:11:26,630][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:11:27,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:11:28,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:11:29,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:11:29,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:11:29,250][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:11:30,618][__main__][INFO] - Iteration 392 took 52s (9.75% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 41m 20s. Estimated total time: 14h 31m 26s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 43s. [2026-03-25 20:11:30,622][__main__][INFO] - Starting iteration 392. [2026-03-25 20:11:30,626][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:11:30,627][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:11:35,547][__main__][INFO] - Number of regex retries in iteration 392: 0 [2026-03-25 20:11:35,548][__main__][INFO] - agents played in iteration 392 are Alice, Bob [2026-03-25 20:11:36,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:11:36,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:11:36,132][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:11:36,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:11:36,800][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:11:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:11:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:11:38,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:11:39,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:11:40,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:11:40,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:11:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:11:42,016][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:11:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:11:43,334][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:11:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:11:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:11:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:11:45,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:11:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:11:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:11:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:11:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:11:49,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:11:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:11:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:11:51,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:11:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:11:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:11:53,228][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:11:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:11:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:11:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:11:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:11:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:11:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:11:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:11:58,505][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:11:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:11:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:12:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:12:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:12:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:12:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:12:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:12:03,785][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:12:04,445][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:12:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:12:05,766][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:12:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:12:07,085][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:12:07,746][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:12:08,702][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:12:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:12:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:12:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:12:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:12:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:12:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:12:13,320][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:12:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:12:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:12:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:12:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:12:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:12:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:12:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:12:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:12:19,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:12:19,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:12:21,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:12:21,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:12:21,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:12:22,453][__main__][INFO] - Iteration 393 took 51s (9.50% Gen, 87.95% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 32m 50s. Estimated total time: 14h 23m 48s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 22s, 500 more iterations: 7h 11m 54s. [2026-03-25 20:12:22,456][__main__][INFO] - Starting iteration 393. [2026-03-25 20:12:22,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:12:22,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:12:27,329][__main__][INFO] - Number of regex retries in iteration 393: 0 [2026-03-25 20:12:27,330][__main__][INFO] - agents played in iteration 393 are Alice, Bob [2026-03-25 20:12:27,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:27,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:28,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:12:28,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:12:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:12:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:12:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:12:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:12:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:12:31,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:12:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:12:33,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:12:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:12:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:12:35,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:12:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:12:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:12:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:12:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:12:38,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:12:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:12:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:12:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:12:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:12:41,827][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:12:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:12:43,146][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:12:43,805][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:12:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:12:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:12:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:12:46,442][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:12:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:12:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:12:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:12:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:12:49,742][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:12:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:12:51,061][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:12:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:12:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:12:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:12:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:12:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:12:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:12:55,682][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:12:56,342][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:12:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:12:57,660][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:12:58,318][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:12:58,978][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:12:59,635][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:13:00,536][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:13:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:13:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:13:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:13:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:13:03,828][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:13:04,487][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:13:05,145][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:13:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:13:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:13:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:13:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:13:08,439][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:13:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:13:09,757][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:13:10,415][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:13:11,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:13:11,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:13:12,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:13:12,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:13:12,937][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:13:14,351][__main__][INFO] - Iteration 394 took 51s (9.38% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 33m 2s. Estimated total time: 14h 24m 52s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 26s. [2026-03-25 20:13:14,353][__main__][INFO] - Starting iteration 394. [2026-03-25 20:13:14,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:13:14,360][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:13:19,325][__main__][INFO] - Number of regex retries in iteration 394: 0 [2026-03-25 20:13:19,326][__main__][INFO] - agents played in iteration 394 are Alice, Bob [2026-03-25 20:13:19,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:13:19,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:13:19,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:13:19,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:13:20,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:13:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:13:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:13:22,479][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:13:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:13:23,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:13:24,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:13:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:13:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:13:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:13:27,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:13:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:13:28,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:13:29,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:13:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:13:30,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:13:31,034][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:13:31,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:13:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:13:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:13:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:13:34,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:13:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:13:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:13:36,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:13:36,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:13:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:13:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:13:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:13:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:13:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:13:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:13:41,568][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:13:42,226][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:13:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:13:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:13:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:13:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:13:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:13:46,177][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:13:46,836][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:13:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:13:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:13:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:13:49,470][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:13:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:13:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:13:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:13:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:13:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:13:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:13:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:13:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:13:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:13:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:13:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:13:57,639][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:13:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:13:58,956][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:13:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:14:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:14:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:14:01,591][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:14:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:14:02,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:14:03,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:14:04,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:14:04,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:14:04,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:14:06,061][__main__][INFO] - Iteration 395 took 51s (9.60% Gen, 87.81% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 29m 1s. Estimated total time: 14h 21m 43s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 51s. [2026-03-25 20:14:06,069][__main__][INFO] - Starting iteration 395. [2026-03-25 20:14:06,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:14:06,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:14:11,435][__main__][INFO] - Number of regex retries in iteration 395: 0 [2026-03-25 20:14:11,437][__main__][INFO] - agents played in iteration 395 are Alice, Bob [2026-03-25 20:14:12,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:14:12,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:14:12,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:14:12,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:14:12,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:14:13,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:14:14,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:14:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:14:15,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:14:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:14:16,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:14:17,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:14:17,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:14:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:14:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:14:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:14:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:14:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:14:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:14:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:14:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:14:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:14:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:14:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:14:25,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:14:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:14:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:14:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:14:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:14:29,182][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:14:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:14:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:14:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:14:31,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:14:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:14:33,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:14:33,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:14:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:14:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:14:35,762][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:14:36,420][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:14:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:14:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:14:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:14:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:14:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:14:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:14:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:14:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:14:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:14:43,005][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:14:43,663][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:14:44,610][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:14:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:14:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:14:46,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:14:47,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:14:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:14:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:14:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:14:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:14:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:14:51,194][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:14:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:14:52,510][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:14:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:14:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:14:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:14:55,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:14:55,829][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:14:56,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:14:56,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:14:56,988][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:14:58,388][__main__][INFO] - Iteration 396 took 52s (10.24% Gen, 87.08% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 38m 16s. Estimated total time: 14h 31m 51s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 55s. [2026-03-25 20:14:58,390][__main__][INFO] - Starting iteration 396. [2026-03-25 20:14:58,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:14:58,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:15:04,085][__main__][INFO] - Number of regex retries in iteration 396: 0 [2026-03-25 20:15:04,087][__main__][INFO] - agents played in iteration 396 are Alice, Bob [2026-03-25 20:15:04,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:04,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:04,695][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:15:04,696][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:15:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:15:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:15:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:15:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:15:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:15:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:15:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:15:09,932][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:15:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:15:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:15:11,909][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:15:12,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:15:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:15:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:15:14,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:15:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:15:15,868][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:15:16,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:15:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:15:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:15:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:15:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:15:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:15:20,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:15:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:15:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:15:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:15:23,114][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:15:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:15:24,430][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:15:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:15:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:15:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:15:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:15:27,728][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:15:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:15:29,045][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:15:29,704][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:15:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:15:31,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:15:31,681][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:15:32,340][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:15:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:15:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:15:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:15:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:15:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:15:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:15:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:15:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:15:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:15:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:15:39,822][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:15:40,480][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:15:41,139][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:15:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:15:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:15:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:15:43,775][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:15:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:15:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:15:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:15:46,410][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:15:47,070][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:15:47,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:15:48,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:15:49,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:15:49,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:15:49,552][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:15:50,991][__main__][INFO] - Iteration 397 took 52s (10.82% Gen, 86.44% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 42m 11s. Estimated total time: 14h 36m 38s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 19s. [2026-03-25 20:15:50,993][__main__][INFO] - Starting iteration 397. [2026-03-25 20:15:50,998][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:15:50,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:15:58,941][__main__][INFO] - Number of regex retries in iteration 397: 0 [2026-03-25 20:15:58,943][__main__][INFO] - agents played in iteration 397 are Alice, Bob [2026-03-25 20:15:59,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:59,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:59,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:15:59,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:16:00,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:16:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:16:01,598][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:16:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:16:02,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:16:03,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:16:04,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:16:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:16:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:16:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:16:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:16:07,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:16:08,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:16:08,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:16:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:16:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:16:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:16:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:16:12,142][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:16:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:16:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:16:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:16:14,779][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:16:15,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:16:16,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:16:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:16:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:16:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:16:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:16:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:16:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:16:20,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:16:21,369][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:16:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:16:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:16:23,344][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:16:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:16:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:16:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:16:25,978][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:16:26,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:16:27,297][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:16:27,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:16:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:16:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:16:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:16:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:16:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:16:32,188][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:16:32,848][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:16:33,507][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:16:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:16:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:16:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:16:36,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:16:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:16:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:16:38,115][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:16:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:16:39,433][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:16:40,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:16:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:16:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:16:42,068][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:16:42,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:16:43,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:16:44,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:16:44,529][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:16:44,530][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:16:45,888][__main__][INFO] - Iteration 398 took 54s (14.47% Gen, 83.05% Train). Generation: 7s, Training: 45s. Estimated remaining time: 9h 19m 31s. Estimated total time: 15h 14m 53s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 26s. [2026-03-25 20:16:45,893][__main__][INFO] - Starting iteration 398. [2026-03-25 20:16:45,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:16:45,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:16:51,616][__main__][INFO] - Number of regex retries in iteration 398: 0 [2026-03-25 20:16:51,618][__main__][INFO] - agents played in iteration 398 are Alice, Bob [2026-03-25 20:16:52,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:16:52,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:16:52,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:16:52,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:16:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:16:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:16:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:16:54,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:16:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:16:56,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:16:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:16:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:16:58,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:16:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:16:59,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:17:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:17:00,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:17:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:17:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:17:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:17:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:17:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:17:04,764][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:17:05,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:17:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:17:06,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:17:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:17:08,057][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:17:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:17:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:17:10,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:17:10,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:17:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:17:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:17:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:17:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:17:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:17:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:17:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:17:15,960][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:17:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:17:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:17:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:17:18,594][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:17:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:17:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:17:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:17:21,227][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:17:21,886][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:17:22,544][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:17:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:17:23,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:17:24,761][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:17:25,421][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:17:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:17:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:17:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:17:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:17:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:17:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:17:30,030][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:17:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:17:31,346][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:17:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:17:32,663][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:17:33,322][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:17:33,980][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:17:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:17:35,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:17:36,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:17:37,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:17:37,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:17:37,891][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:17:39,190][__main__][INFO] - Iteration 399 took 53s (10.74% Gen, 86.82% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 52m 0s. Estimated total time: 14h 48m 15s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 49s, 500 more iterations: 7h 24m 7s. [2026-03-25 20:17:39,192][__main__][INFO] - Starting iteration 399. [2026-03-25 20:17:39,196][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:17:39,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:17:44,128][__main__][INFO] - Number of regex retries in iteration 399: 0 [2026-03-25 20:17:44,129][__main__][INFO] - agents played in iteration 399 are Alice, Bob [2026-03-25 20:17:44,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:17:44,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:17:44,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:17:44,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:17:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:17:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:17:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:17:47,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:17:47,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:17:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:17:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:17:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:17:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:17:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:17:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:17:52,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:17:53,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:17:53,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:17:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:17:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:17:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:17:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:17:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:17:57,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:17:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:17:59,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:17:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:18:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:18:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:18:01,780][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:18:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:18:03,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:18:03,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:18:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:18:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:18:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:18:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:18:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:18:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:18:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:18:09,025][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:18:09,685][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:18:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:18:11,002][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:18:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:18:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:18:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:18:13,635][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:18:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:18:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:18:15,611][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:18:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:18:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:18:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:18:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:18:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:18:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:18:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:18:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:18:21,811][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:18:22,470][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:18:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:18:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:18:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:18:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:18:25,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:18:26,427][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:18:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:18:27,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:18:28,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:18:29,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:18:29,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:18:29,606][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:18:31,046][__main__][INFO] - Iteration 400 took 51s (9.51% Gen, 87.71% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 27m 4s. Estimated total time: 14h 24m 11s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 5s. [2026-03-25 20:18:31,049][__main__][INFO] - Starting iteration 400. [2026-03-25 20:18:31,052][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:18:31,053][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:18:35,799][__main__][INFO] - Number of regex retries in iteration 400: 0 [2026-03-25 20:18:35,800][__main__][INFO] - agents played in iteration 400 are Alice, Bob [2026-03-25 20:18:36,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:18:36,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:18:36,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:18:36,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:18:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:18:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:18:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:18:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:18:39,755][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:18:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:18:41,071][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:18:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:18:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:18:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:18:43,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:18:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:18:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:18:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:18:46,333][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:18:46,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:18:47,650][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:18:48,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:18:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:18:49,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:18:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:18:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:18:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:18:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:18:52,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:18:53,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:18:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:18:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:18:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:18:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:18:56,871][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:18:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:18:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:18:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:18:59,508][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:19:00,167][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:19:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:19:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:19:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:19:02,804][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:19:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:19:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:19:04,782][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:19:05,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:19:06,100][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:19:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:19:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:19:08,077][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:19:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:19:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:19:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:19:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:19:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:19:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:19:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:19:13,597][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:19:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:19:14,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:19:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:19:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:19:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:19:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:19:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:19:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:19:19,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:19:20,366][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:19:21,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:19:21,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:19:21,464][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:19:24,197][__main__][INFO] - Iteration 401 took 53s (8.93% Gen, 85.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 47m 46s. Estimated total time: 14h 45m 46s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 53s. [2026-03-25 20:19:24,200][__main__][INFO] - Starting iteration 401. [2026-03-25 20:19:24,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:19:24,206][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:19:30,231][__main__][INFO] - Number of regex retries in iteration 401: 0 [2026-03-25 20:19:30,232][__main__][INFO] - agents played in iteration 401 are Alice, Bob [2026-03-25 20:19:30,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:19:30,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:19:30,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:19:30,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:19:31,516][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:19:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:19:32,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:19:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:19:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:19:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:19:35,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:19:36,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:19:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:19:37,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:19:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:19:38,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:19:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:19:40,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:19:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:19:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:19:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:19:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:19:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:19:43,990][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:19:44,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:19:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:19:45,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:19:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:19:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:19:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:19:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:19:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:19:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:19:50,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:19:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:19:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:19:52,567][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:19:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:19:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:19:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:19:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:19:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:19:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:19:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:19:57,842][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:19:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:19:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:19:59,820][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:20:00,479][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:20:01,137][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:20:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:20:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:20:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:20:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:20:04,766][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:20:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:20:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:20:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:20:07,401][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:20:08,059][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:20:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:20:09,377][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:20:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:20:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:20:11,353][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:20:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:20:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:20:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:20:13,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:20:14,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:20:15,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:20:15,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:20:15,789][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:20:17,176][__main__][INFO] - Iteration 402 took 52s (11.38% Gen, 86.00% Train). Generation: 6s, Training: 45s. Estimated remaining time: 8h 44m 1s. Estimated total time: 14h 42m 54s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 27s. [2026-03-25 20:20:17,178][__main__][INFO] - Starting iteration 402. [2026-03-25 20:20:17,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:20:17,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:20:27,709][__main__][INFO] - Number of regex retries in iteration 402: 0 [2026-03-25 20:20:27,711][__main__][INFO] - agents played in iteration 402 are Alice, Bob [2026-03-25 20:20:28,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:20:28,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:20:28,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:20:28,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:20:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:20:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:20:30,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:20:30,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:20:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:20:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:20:32,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:20:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:20:34,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:20:34,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:20:35,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:20:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:20:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:20:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:20:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:20:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:20:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:20:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:20:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:20:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:20:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:20:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:20:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:20:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:20:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:20:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:20:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:20:46,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:20:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:20:47,999][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:20:48,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:20:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:20:49,975][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:20:50,633][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:20:51,291][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:20:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:20:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:20:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:20:53,926][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:20:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:20:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:20:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:20:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:20:57,223][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:20:57,882][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:20:58,541][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:20:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:20:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:21:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:21:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:21:02,096][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:21:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:21:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:21:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:21:04,732][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:21:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:21:06,049][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:21:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:21:07,366][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:21:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:21:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:21:09,343][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:21:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:21:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:21:11,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:21:12,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:21:13,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:21:13,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:21:13,125][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:21:14,440][__main__][INFO] - Iteration 403 took 57s (18.39% Gen, 79.31% Train). Generation: 10s, Training: 45s. Estimated remaining time: 9h 54m 29s. Estimated total time: 15h 54m 19s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 25s, 500 more iterations: 7h 57m 9s. [2026-03-25 20:21:14,443][__main__][INFO] - Starting iteration 403. [2026-03-25 20:21:14,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:21:14,449][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:21:19,937][__main__][INFO] - Number of regex retries in iteration 403: 0 [2026-03-25 20:21:19,938][__main__][INFO] - agents played in iteration 403 are Alice, Bob [2026-03-25 20:21:20,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:21:20,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:21:20,565][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:21:20,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:21:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:21:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:21:22,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:21:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:21:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:21:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:21:25,146][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:21:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:21:26,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:21:27,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:21:27,783][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:21:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:21:29,100][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:21:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:21:30,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:21:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:21:31,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:21:32,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:21:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:21:33,709][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:21:34,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:21:35,026][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:21:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:21:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:21:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:21:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:21:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:21:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:21:39,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:21:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:21:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:21:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:21:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:21:42,929][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:21:43,587][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:21:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:21:44,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:21:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:21:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:21:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:21:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:21:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:21:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:21:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:21:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:21:50,835][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:21:51,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:21:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:21:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:21:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:21:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:21:55,051][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:21:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:21:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:21:57,029][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:21:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:21:58,346][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:21:59,005][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:21:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:22:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:22:00,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:22:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:22:02,301][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:22:02,960][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:22:03,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:22:04,464][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:22:05,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:22:05,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:22:05,803][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:22:07,211][__main__][INFO] - Iteration 404 took 52s (10.40% Gen, 86.92% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 38m 41s. Estimated total time: 14h 39m 24s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 42s. [2026-03-25 20:22:07,213][__main__][INFO] - Starting iteration 404. [2026-03-25 20:22:07,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:22:07,217][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:22:12,093][__main__][INFO] - Number of regex retries in iteration 404: 0 [2026-03-25 20:22:12,094][__main__][INFO] - agents played in iteration 404 are Alice, Bob [2026-03-25 20:22:12,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:22:12,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:22:12,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:22:12,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:22:13,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:22:14,013][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:22:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:22:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:22:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:22:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:22:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:22:17,978][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:22:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:22:19,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:22:19,960][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:22:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:22:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:22:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:22:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:22:23,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:22:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:22:24,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:22:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:22:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:22:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:22:27,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:22:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:22:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:22:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:22:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:22:30,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:22:31,190][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:22:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:22:32,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:22:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:22:33,828][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:22:34,488][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:22:35,148][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:22:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:22:36,469][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:22:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:22:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:22:38,452][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:22:39,111][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:22:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:22:40,433][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:22:41,091][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:22:41,752][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:22:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:22:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:22:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:22:44,393][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:22:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:22:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:22:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:22:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:22:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:22:48,657][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:22:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:22:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:22:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:22:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:22:51,958][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:22:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:22:53,279][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:22:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:22:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:22:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:22:55,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:22:56,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:22:57,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:22:57,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:22:57,932][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:22:59,384][__main__][INFO] - Iteration 405 took 52s (9.31% Gen, 87.86% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 27m 54s. Estimated total time: 14h 29m 29s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 44s. [2026-03-25 20:22:59,387][__main__][INFO] - Starting iteration 405. [2026-03-25 20:22:59,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:22:59,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:23:10,442][__main__][INFO] - Number of regex retries in iteration 405: 0 [2026-03-25 20:23:10,442][__main__][INFO] - agents played in iteration 405 are Alice, Bob [2026-03-25 20:23:11,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:23:11,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:23:11,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:23:11,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:23:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:23:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:23:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:23:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:23:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:23:15,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:23:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:23:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:23:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:23:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:23:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:23:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:23:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:23:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:23:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:23:21,683][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:23:22,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:23:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:23:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:23:24,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:23:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:23:25,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:23:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:23:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:23:27,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:23:28,285][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:23:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:23:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:23:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:23:30,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:23:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:23:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:23:32,905][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:23:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:23:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:23:34,886][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:23:35,546][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:23:36,206][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:23:36,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:23:37,524][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:23:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:23:38,845][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:23:39,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:23:40,163][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:23:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:23:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:23:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:23:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:23:43,705][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:23:44,365][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:23:45,025][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:23:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:23:46,344][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:23:47,003][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:23:47,663][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:23:48,324][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:23:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:23:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:23:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:23:50,962][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:23:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:23:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:23:52,944][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:23:53,605][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:23:54,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:23:55,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:23:56,224][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:23:56,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:23:56,228][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:23:57,463][__main__][INFO] - Iteration 406 took 58s (19.03% Gen, 78.84% Train). Generation: 11s, Training: 45s. Estimated remaining time: 10h 5m 20s. Estimated total time: 16h 7m 54s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 47s, 500 more iterations: 8h 3m 57s. [2026-03-25 20:23:57,465][__main__][INFO] - Starting iteration 406. [2026-03-25 20:23:57,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:23:57,470][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:24:02,430][__main__][INFO] - Number of regex retries in iteration 406: 0 [2026-03-25 20:24:02,433][__main__][INFO] - agents played in iteration 406 are Alice, Bob [2026-03-25 20:24:03,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:03,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:03,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:24:03,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:24:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:24:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:24:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:24:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:24:06,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:24:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:24:07,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:24:08,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:24:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:24:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:24:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:24:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:24:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:24:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:24:12,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:24:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:24:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:24:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:24:15,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:24:16,270][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:24:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:24:17,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:24:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:24:18,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:24:19,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:24:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:24:20,884][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:24:21,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:24:22,203][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:24:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:24:23,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:24:24,181][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:24:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:24:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:24:26,162][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:24:26,821][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:24:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:24:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:24:28,797][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:24:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:24:30,114][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:24:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:24:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:24:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:24:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:24:33,406][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:24:34,065][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:24:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:24:35,629][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:24:36,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:24:36,944][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:24:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:24:38,263][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:24:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:24:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:24:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:24:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:24:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:24:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:24:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:24:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:24:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:24:44,853][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:24:45,512][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:24:46,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:24:46,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:24:48,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:24:48,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:24:52,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:24:53,528][__main__][INFO] - Iteration 407 took 56s (8.85% Gen, 88.66% Train). Generation: 4s, Training: 49s. Estimated remaining time: 9h 30m 51s. Estimated total time: 15h 34m 20s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 10s. [2026-03-25 20:24:53,530][__main__][INFO] - Starting iteration 407. [2026-03-25 20:24:53,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:24:53,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:24:58,638][__main__][INFO] - Number of regex retries in iteration 407: 0 [2026-03-25 20:24:58,639][__main__][INFO] - agents played in iteration 407 are Alice, Bob [2026-03-25 20:24:59,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:59,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:59,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:24:59,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:24:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:25:00,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:25:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:25:01,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:25:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:25:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:25:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:25:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:25:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:25:05,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:25:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:25:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:25:07,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:25:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:25:09,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:25:09,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:25:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:25:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:25:11,795][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:25:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:25:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:25:13,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:25:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:25:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:25:15,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:25:16,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:25:17,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:25:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:25:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:25:19,037][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:25:19,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:25:20,353][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:25:21,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:25:21,669][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:25:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:25:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:25:23,643][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:25:24,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:25:24,960][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:25:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:25:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:25:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:25:27,596][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:25:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:25:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:25:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:25:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:25:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:25:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:25:32,487][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:25:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:25:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:25:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:25:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:25:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:25:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:25:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:25:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:25:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:25:39,076][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:25:39,735][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:25:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:25:41,053][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:25:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:25:42,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:25:43,117][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:25:44,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:25:44,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:25:44,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:25:45,413][__main__][INFO] - Iteration 408 took 51s (9.84% Gen, 87.83% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 20m 19s. Estimated total time: 14h 24m 40s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 20s. [2026-03-25 20:25:45,415][__main__][INFO] - Starting iteration 408. [2026-03-25 20:25:45,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:25:45,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:25:50,517][__main__][INFO] - Number of regex retries in iteration 408: 0 [2026-03-25 20:25:50,518][__main__][INFO] - agents played in iteration 408 are Alice, Bob [2026-03-25 20:25:51,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:51,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:51,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:25:51,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:25:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:25:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:25:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:25:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:25:54,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:25:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:25:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:25:56,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:25:56,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:25:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:25:58,289][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:25:58,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:25:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:26:00,262][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:26:00,919][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:26:01,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:26:02,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:26:02,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:26:03,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:26:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:26:04,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:26:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:26:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:26:06,839][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:26:07,498][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:26:08,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:26:08,814][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:26:09,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:26:10,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:26:10,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:26:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:26:12,103][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:26:12,761][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:26:13,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:26:14,077][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:26:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:26:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:26:16,052][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:26:16,710][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:26:17,368][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:26:18,026][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:26:18,685][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:26:19,342][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:26:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:26:20,658][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:26:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:26:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:26:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:26:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:26:24,480][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:26:25,140][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:26:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:26:26,461][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:26:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:26:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:26:28,543][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:26:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:26:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:26:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:26:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:26:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:26:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:26:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:26:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:26:34,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:26:35,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:26:36,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:26:36,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:26:36,515][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:26:37,854][__main__][INFO] - Iteration 409 took 52s (9.72% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 28m 40s. Estimated total time: 14h 33m 54s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 57s. [2026-03-25 20:26:37,859][__main__][INFO] - Starting iteration 409. [2026-03-25 20:26:37,884][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:26:37,884][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:26:42,707][__main__][INFO] - Number of regex retries in iteration 409: 0 [2026-03-25 20:26:42,708][__main__][INFO] - agents played in iteration 409 are Alice, Bob [2026-03-25 20:26:43,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:26:43,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:26:43,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:26:43,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:26:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:26:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:26:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:26:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:26:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:26:47,490][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:26:48,148][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:26:48,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:26:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:26:50,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:26:50,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:26:51,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:26:52,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:26:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:26:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:26:54,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:26:54,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:26:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:26:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:26:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:26:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:26:58,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:26:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:26:59,330][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:26:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:27:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:27:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:27:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:27:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:27:03,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:27:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:27:04,593][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:27:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:27:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:27:06,566][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:27:07,224][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:27:07,882][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:27:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:27:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:27:09,856][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:27:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:27:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:27:11,829][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:27:12,487][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:27:13,144][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:27:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:27:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:27:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:27:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:27:16,689][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:27:17,348][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:27:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:27:18,664][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:27:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:27:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:27:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:27:21,294][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:27:21,952][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:27:22,610][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:27:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:27:23,926][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:27:24,584][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:27:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:27:25,901][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:27:26,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:27:27,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:27:28,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:27:28,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:27:28,402][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:27:29,765][__main__][INFO] - Iteration 410 took 51s (9.30% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 18m 37s. Estimated total time: 14h 24m 42s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 21s. [2026-03-25 20:27:29,768][__main__][INFO] - Starting iteration 410. [2026-03-25 20:27:29,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:27:29,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:27:48,056][__main__][INFO] - Number of regex retries in iteration 410: 0 [2026-03-25 20:27:48,057][__main__][INFO] - agents played in iteration 410 are Alice, Bob [2026-03-25 20:27:48,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:27:48,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:27:48,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:27:48,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:27:49,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:27:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:27:50,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:27:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:27:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:27:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:27:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:27:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:27:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:27:55,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:27:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:27:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:27:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:27:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:27:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:27:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:27:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:28:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:28:01,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:28:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:28:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:28:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:28:03,740][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:28:04,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:28:05,054][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:28:05,711][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:28:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:28:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:28:07,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:28:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:28:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:28:09,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:28:10,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:28:10,971][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:28:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:28:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:28:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:28:13,601][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:28:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:28:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:28:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:28:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:28:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:28:17,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:28:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:28:18,862][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:28:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:28:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:28:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:28:21,782][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:28:22,439][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:28:23,097][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:28:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:28:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:28:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:28:25,728][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:28:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:28:27,044][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:28:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:28:28,359][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:28:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:28:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:28:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:28:30,990][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:28:31,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:28:32,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:28:33,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:28:33,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:28:33,457][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:28:34,996][__main__][INFO] - Iteration 411 took 1m 5s (28.03% Gen, 69.60% Train). Generation: 18s, Training: 45s. Estimated remaining time: 11h 59m 56s. Estimated total time: 18h 7m 6s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 42s, 500 more iterations: 9h 3m 33s. [2026-03-25 20:28:35,999][__main__][INFO] - Starting iteration 411. [2026-03-25 20:28:35,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:28:35,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:28:40,178][__main__][INFO] - Number of regex retries in iteration 411: 0 [2026-03-25 20:28:40,180][__main__][INFO] - agents played in iteration 411 are Alice, Bob [2026-03-25 20:28:40,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:28:40,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:28:40,884][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:28:40,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:28:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:28:42,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:28:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:28:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:28:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:28:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:28:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:28:46,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:28:46,799][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:28:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:28:48,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:28:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:28:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:28:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:28:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:28:51,420][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:28:52,078][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:28:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:28:53,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:28:54,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:28:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:28:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:28:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:28:56,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:28:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:28:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:28:58,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:28:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:28:59,983][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:29:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:29:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:29:01,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:29:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:29:03,275][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:29:03,933][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:29:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:29:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:29:05,909][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:29:06,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:29:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:29:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:29:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:29:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:29:09,858][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:29:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:29:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:29:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:29:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:29:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:29:14,148][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:29:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:29:15,465][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:29:16,126][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:29:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:29:17,443][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:29:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:29:18,764][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:29:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:29:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:29:20,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:29:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:29:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:29:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:29:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:29:24,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:29:24,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:29:25,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:29:25,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:29:25,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:29:27,359][__main__][INFO] - Iteration 412 took 52s (9.89% Gen, 87.42% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 24m 34s. Estimated total time: 14h 32m 37s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 18s. [2026-03-25 20:29:27,361][__main__][INFO] - Starting iteration 412. [2026-03-25 20:29:27,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:29:27,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:29:32,147][__main__][INFO] - Number of regex retries in iteration 412: 0 [2026-03-25 20:29:32,148][__main__][INFO] - agents played in iteration 412 are Alice, Bob [2026-03-25 20:29:32,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:29:32,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:29:32,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:29:32,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:29:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:29:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:29:34,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:29:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:29:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:29:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:29:37,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:29:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:29:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:29:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:29:40,090][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:29:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:29:41,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:29:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:29:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:29:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:29:44,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:29:44,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:29:45,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:29:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:29:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:29:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:29:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:29:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:29:49,314][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:29:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:29:50,631][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:29:51,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:29:51,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:29:52,605][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:29:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:29:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:29:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:29:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:29:55,897][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:29:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:29:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:29:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:29:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:29:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:29:59,847][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:30:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:30:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:30:01,824][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:30:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:30:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:30:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:30:04,457][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:30:05,457][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:30:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:30:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:30:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:30:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:30:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:30:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:30:10,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:30:10,728][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:30:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:30:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:30:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:30:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:30:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:30:14,679][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:30:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:30:15,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:30:16,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:30:17,843][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:30:17,846][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:30:17,849][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:30:19,179][__main__][INFO] - Iteration 413 took 51s (9.23% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 14m 41s. Estimated total time: 14h 23m 36s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 48s. [2026-03-25 20:30:19,181][__main__][INFO] - Starting iteration 413. [2026-03-25 20:30:19,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:30:19,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:30:24,237][__main__][INFO] - Number of regex retries in iteration 413: 0 [2026-03-25 20:30:24,238][__main__][INFO] - agents played in iteration 413 are Alice, Bob [2026-03-25 20:30:24,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:30:24,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:30:24,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:30:24,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:30:25,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:30:26,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:30:26,909][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:30:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:30:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:30:28,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:30:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:30:30,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:30:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:30:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:30:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:30:32,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:30:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:30:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:30:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:30:35,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:30:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:30:36,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:30:37,453][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:30:38,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:30:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:30:39,431][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:30:40,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:30:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:30:41,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:30:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:30:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:30:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:30:44,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:30:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:30:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:30:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:30:46,688][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:30:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:30:48,005][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:30:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:30:49,322][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:30:49,981][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:30:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:30:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:30:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:30:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:30:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:30:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:30:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:30:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:30:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:30:56,569][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:30:57,539][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:30:58,199][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:30:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:30:59,516][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:31:00,174][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:31:00,832][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:31:01,490][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:31:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:31:02,808][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:31:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:31:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:31:04,784][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:31:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:31:06,101][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:31:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:31:07,421][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:31:08,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:31:08,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:31:10,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:31:10,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:31:10,021][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:31:11,505][__main__][INFO] - Iteration 414 took 52s (9.66% Gen, 87.50% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 22m 14s. Estimated total time: 14h 32m 2s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 1s. [2026-03-25 20:31:11,512][__main__][INFO] - Starting iteration 414. [2026-03-25 20:31:11,531][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:31:11,532][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:31:16,249][__main__][INFO] - Number of regex retries in iteration 414: 0 [2026-03-25 20:31:16,250][__main__][INFO] - agents played in iteration 414 are Alice, Bob [2026-03-25 20:31:16,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:31:16,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:31:16,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:31:16,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:31:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:31:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:31:18,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:31:19,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:31:20,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:31:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:31:21,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:31:22,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:31:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:31:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:31:24,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:31:24,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:31:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:31:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:31:26,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:31:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:31:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:31:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:31:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:31:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:31:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:31:31,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:31:32,010][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:31:32,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:31:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:31:33,984][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:31:34,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:31:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:31:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:31:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:31:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:31:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:31:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:31:39,251][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:31:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:31:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:31:41,230][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:31:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:31:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:31:43,208][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:31:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:31:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:31:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:31:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:31:46,506][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:31:47,165][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:31:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:31:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:31:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:31:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:31:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:31:51,357][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:31:52,016][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:31:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:31:53,336][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:31:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:31:54,654][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:31:55,314][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:31:55,978][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:31:56,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:31:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:31:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:31:58,616][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:31:59,275][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:31:59,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:32:00,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:32:01,867][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:32:01,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:32:01,872][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:32:03,239][__main__][INFO] - Iteration 415 took 51s (9.12% Gen, 88.23% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 11m 11s. Estimated total time: 14h 21m 50s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 55s. [2026-03-25 20:32:03,242][__main__][INFO] - Starting iteration 415. [2026-03-25 20:32:03,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:32:03,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:32:09,277][__main__][INFO] - Number of regex retries in iteration 415: 0 [2026-03-25 20:32:09,278][__main__][INFO] - agents played in iteration 415 are Alice, Bob [2026-03-25 20:32:09,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:32:09,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:32:09,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:32:09,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:32:10,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:32:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:32:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:32:12,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:32:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:32:13,873][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:32:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:32:15,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:32:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:32:16,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:32:17,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:32:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:32:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:32:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:32:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:32:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:32:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:32:21,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:32:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:32:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:32:23,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:32:24,420][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:32:25,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:32:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:32:26,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:32:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:32:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:32:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:32:29,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:32:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:32:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:32:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:32:31,675][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:32:32,334][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:32:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:32:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:32:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:32:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:32:35,631][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:32:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:32:36,950][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:32:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:32:38,270][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:32:38,929][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:32:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:32:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:32:40,908][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:32:41,568][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:32:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:32:43,160][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:32:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:32:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:32:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:32:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:32:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:32:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:32:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:32:48,436][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:32:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:32:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:32:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:32:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:32:51,734][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:32:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:32:53,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:32:53,948][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:32:55,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:32:55,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:32:55,078][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:32:56,371][__main__][INFO] - Iteration 416 took 53s (11.35% Gen, 86.21% Train). Generation: 6s, Training: 45s. Estimated remaining time: 8h 33m 54s. Estimated total time: 14h 45m 26s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 32s, 500 more iterations: 7h 22m 43s. [2026-03-25 20:32:56,373][__main__][INFO] - Starting iteration 416. [2026-03-25 20:32:56,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:32:56,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:33:01,149][__main__][INFO] - Number of regex retries in iteration 416: 0 [2026-03-25 20:33:01,150][__main__][INFO] - agents played in iteration 416 are Alice, Bob [2026-03-25 20:33:01,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:01,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:01,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:33:01,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:33:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:33:03,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:33:03,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:33:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:33:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:33:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:33:06,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:33:07,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:33:07,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:33:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:33:09,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:33:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:33:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:33:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:33:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:33:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:33:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:33:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:33:14,338][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:33:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:33:15,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:33:16,315][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:33:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:33:17,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:33:18,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:33:18,953][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:33:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:33:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:33:20,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:33:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:33:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:33:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:33:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:33:24,225][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:33:24,884][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:33:25,543][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:33:26,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:33:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:33:27,524][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:33:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:33:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:33:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:33:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:33:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:33:31,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:33:32,137][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:33:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:33:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:33:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:33:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:33:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:33:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:33:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:33:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:33:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:33:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:33:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:33:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:33:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:33:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:33:42,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:33:42,975][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:33:43,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:33:44,292][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:33:44,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:33:45,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:33:46,816][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:33:46,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:33:46,822][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:33:48,282][__main__][INFO] - Iteration 417 took 51s (9.20% Gen, 87.99% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 12m 43s. Estimated total time: 14h 25m 7s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 33s. [2026-03-25 20:33:48,285][__main__][INFO] - Starting iteration 417. [2026-03-25 20:33:48,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:33:48,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:33:53,286][__main__][INFO] - Number of regex retries in iteration 417: 0 [2026-03-25 20:33:53,288][__main__][INFO] - agents played in iteration 417 are Alice, Bob [2026-03-25 20:33:53,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:53,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:53,841][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:33:53,841][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:33:54,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:33:55,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:33:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:33:56,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:33:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:33:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:33:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:33:59,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:33:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:34:00,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:34:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:34:01,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:34:02,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:34:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:34:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:34:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:34:05,010][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:34:05,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:34:06,328][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:34:06,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:34:07,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:34:08,305][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:34:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:34:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:34:10,282][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:34:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:34:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:34:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:34:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:34:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:34:14,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:34:14,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:34:15,554][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:34:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:34:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:34:17,530][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:34:18,190][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:34:18,849][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:34:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:34:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:34:20,826][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:34:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:34:22,144][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:34:22,803][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:34:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:34:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:34:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:34:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:34:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:34:27,019][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:34:27,677][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:34:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:34:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:34:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:34:30,313][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:34:30,972][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:34:31,631][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:34:32,290][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:34:32,949][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:34:33,608][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:34:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:34:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:34:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:34:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:34:36,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:34:37,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:34:38,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:34:38,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:34:38,702][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:34:39,961][__main__][INFO] - Iteration 418 took 51s (9.67% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 7m 58s. Estimated total time: 14h 21m 14s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 7s, 500 more iterations: 7h 10m 37s. [2026-03-25 20:34:39,964][__main__][INFO] - Starting iteration 418. [2026-03-25 20:34:39,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:34:39,969][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:34:44,947][__main__][INFO] - Number of regex retries in iteration 418: 0 [2026-03-25 20:34:44,949][__main__][INFO] - agents played in iteration 418 are Alice, Bob [2026-03-25 20:34:45,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:34:45,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:34:45,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:34:45,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:34:46,219][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:34:46,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:34:47,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:34:48,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:34:48,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:34:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:34:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:34:50,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:34:51,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:34:52,116][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:34:52,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:34:53,433][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:34:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:34:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:34:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:34:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:34:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:34:57,387][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:34:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:34:58,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:34:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:35:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:35:00,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:35:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:35:01,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:35:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:35:03,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:35:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:35:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:35:05,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:35:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:35:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:35:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:35:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:35:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:35:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:35:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:35:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:35:11,224][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:35:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:35:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:35:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:35:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:35:14,519][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:35:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:35:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:35:16,497][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:35:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:35:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:35:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:35:19,428][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:35:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:35:20,746][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:35:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:35:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:35:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:35:23,381][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:35:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:35:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:35:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:35:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:35:26,680][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:35:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:35:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:35:28,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:35:29,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:35:30,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:35:30,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:35:30,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:35:31,952][__main__][INFO] - Iteration 419 took 51s (9.58% Gen, 87.90% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 12m 18s. Estimated total time: 14h 26m 26s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 13s. [2026-03-25 20:35:31,955][__main__][INFO] - Starting iteration 419. [2026-03-25 20:35:31,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:35:31,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:35:36,705][__main__][INFO] - Number of regex retries in iteration 419: 0 [2026-03-25 20:35:36,706][__main__][INFO] - agents played in iteration 419 are Alice, Bob [2026-03-25 20:35:37,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:37,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:37,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:35:37,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:35:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:35:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:35:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:35:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:35:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:35:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:35:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:35:42,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:35:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:35:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:35:44,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:35:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:35:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:35:46,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:35:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:35:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:35:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:35:49,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:35:49,764][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:35:50,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:35:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:35:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:35:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:35:53,059][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:35:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:35:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:35:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:35:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:35:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:35:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:35:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:35:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:35:58,993][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:35:59,652][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:36:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:36:00,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:36:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:36:02,288][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:36:02,947][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:36:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:36:04,269][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:36:04,930][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:36:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:36:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:36:06,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:36:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:36:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:36:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:36:09,834][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:36:10,496][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:36:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:36:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:36:12,480][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:36:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:36:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:36:14,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:36:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:36:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:36:16,445][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:36:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:36:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:36:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:36:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:36:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:36:20,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:36:21,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:36:22,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:36:22,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:36:22,251][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:36:23,713][__main__][INFO] - Iteration 420 took 51s (9.17% Gen, 88.00% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 7m 36s. Estimated total time: 14h 22m 36s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 15s, 500 more iterations: 7h 11m 18s. [2026-03-25 20:36:23,716][__main__][INFO] - Starting iteration 420. [2026-03-25 20:36:23,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:36:23,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:36:29,322][__main__][INFO] - Number of regex retries in iteration 420: 0 [2026-03-25 20:36:29,324][__main__][INFO] - agents played in iteration 420 are Alice, Bob [2026-03-25 20:36:29,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:36:29,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:36:29,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:36:29,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:36:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:36:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:36:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:36:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:36:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:36:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:36:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:36:35,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:36:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:36:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:36:37,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:36:37,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:36:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:36:39,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:36:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:36:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:36:41,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:36:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:36:42,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:36:43,123][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:36:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:36:44,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:36:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:36:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:36:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:36:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:36:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:36:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:36:49,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:36:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:36:50,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:36:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:36:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:36:52,354][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:36:53,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:36:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:36:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:36:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:36:55,653][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:36:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:36:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:36:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:36:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:36:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:36:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:37:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:37:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:37:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:37:02,534][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:37:03,193][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:37:03,855][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:37:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:37:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:37:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:37:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:37:07,150][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:37:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:37:08,468][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:37:09,127][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:37:09,786][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:37:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:37:11,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:37:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:37:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:37:13,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:37:13,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:37:14,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:37:14,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:37:14,947][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:37:16,504][__main__][INFO] - Iteration 421 took 52s (10.62% Gen, 86.43% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 23m 53s. Estimated total time: 14h 39m 45s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 58s, 500 more iterations: 7h 19m 52s. [2026-03-25 20:37:16,506][__main__][INFO] - Starting iteration 421. [2026-03-25 20:37:16,510][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:37:16,510][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:37:21,331][__main__][INFO] - Number of regex retries in iteration 421: 0 [2026-03-25 20:37:21,332][__main__][INFO] - agents played in iteration 421 are Alice, Bob [2026-03-25 20:37:21,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:37:21,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:37:21,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:37:21,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:37:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:37:23,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:37:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:37:24,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:37:25,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:37:25,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:37:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:37:27,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:37:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:37:28,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:37:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:37:29,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:37:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:37:31,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:37:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:37:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:37:33,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:37:33,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:37:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:37:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:37:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:37:36,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:37:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:37:37,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:37:38,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:37:39,131][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:37:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:37:40,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:37:41,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:37:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:37:42,428][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:37:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:37:43,746][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:37:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:37:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:37:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:37:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:37:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:37:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:37:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:37:49,017][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:37:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:37:50,336][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:37:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:37:51,654][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:37:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:37:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:37:53,630][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:37:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:37:55,252][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:37:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:37:56,572][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:37:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:37:57,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:37:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:37:59,208][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:37:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:38:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:38:01,184][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:38:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:38:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:38:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:38:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:38:04,477][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:38:05,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:38:05,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:38:07,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:38:07,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:38:07,149][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:38:08,482][__main__][INFO] - Iteration 422 took 51s (9.28% Gen, 88.15% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 9m 29s. Estimated total time: 14h 26m 14s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 7s. [2026-03-25 20:38:08,484][__main__][INFO] - Starting iteration 422. [2026-03-25 20:38:08,490][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:38:08,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:38:13,964][__main__][INFO] - Number of regex retries in iteration 422: 0 [2026-03-25 20:38:13,965][__main__][INFO] - agents played in iteration 422 are Alice, Bob [2026-03-25 20:38:14,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:38:14,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:38:14,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:38:14,694][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:38:15,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:38:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:38:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:38:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:38:18,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:38:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:38:19,374][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:38:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:38:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:38:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:38:22,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:38:22,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:38:23,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:38:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:38:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:38:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:38:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:38:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:38:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:38:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:38:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:38:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:38:29,910][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:38:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:38:31,228][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:38:31,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:38:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:38:33,203][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:38:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:38:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:38:35,179][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:38:35,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:38:36,496][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:38:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:38:37,813][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:38:38,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:38:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:38:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:38:40,449][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:38:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:38:41,770][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:38:42,428][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:38:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:38:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:38:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:38:45,062][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:38:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:38:46,379][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:38:47,302][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:38:47,961][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:38:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:38:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:38:49,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:38:50,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:38:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:38:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:38:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:38:53,235][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:38:53,894][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:38:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:38:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:38:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:38:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:38:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:38:57,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:38:58,636][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:38:59,756][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:38:59,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:38:59,761][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:39:01,045][__main__][INFO] - Iteration 423 took 52s (10.42% Gen, 87.13% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 18m 20s. Estimated total time: 14h 35m 57s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 58s. [2026-03-25 20:39:01,048][__main__][INFO] - Starting iteration 423. [2026-03-25 20:39:01,052][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:39:01,053][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:39:06,174][__main__][INFO] - Number of regex retries in iteration 423: 0 [2026-03-25 20:39:06,175][__main__][INFO] - agents played in iteration 423 are Alice, Bob [2026-03-25 20:39:06,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:39:06,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:39:06,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:39:06,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:39:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:39:08,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:39:08,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:39:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:39:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:39:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:39:11,417][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:39:12,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:39:12,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:39:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:39:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:39:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:39:15,373][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:39:16,032][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:39:16,690][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:39:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:39:18,007][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:39:18,666][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:39:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:39:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:39:20,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:39:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:39:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:39:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:39:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:39:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:39:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:39:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:39:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:39:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:39:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:39:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:39:28,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:39:29,221][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:39:29,881][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:39:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:39:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:39:31,859][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:39:32,518][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:39:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:39:33,834][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:39:34,493][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:39:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:39:35,811][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:39:36,471][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:39:37,130][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:39:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:39:38,451][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:39:39,403][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:39:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:39:40,725][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:39:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:39:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:39:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:39:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:39:44,023][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:39:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:39:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:39:46,002][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:39:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:39:47,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:39:47,981][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:39:48,640][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:39:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:39:49,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:39:50,643][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:39:51,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:39:51,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:39:51,793][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:40:00,161][__main__][INFO] - Iteration 424 took 59s (8.66% Gen, 77.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 6m 34s. Estimated total time: 16h 25m 10s. Time estimates for 10 more iterations: 9m 51s, 100 more iterations: 1h 38m 31s, 500 more iterations: 8h 12m 35s. [2026-03-25 20:40:00,164][__main__][INFO] - Starting iteration 424. [2026-03-25 20:40:00,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:40:00,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:40:09,086][__main__][INFO] - Number of regex retries in iteration 424: 0 [2026-03-25 20:40:09,087][__main__][INFO] - agents played in iteration 424 are Alice, Bob [2026-03-25 20:40:09,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:40:09,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:40:09,756][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:40:09,756][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:40:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:40:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:40:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:40:12,343][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:40:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:40:13,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:40:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:40:14,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:40:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:40:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:40:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:40:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:40:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:40:18,927][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:40:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:40:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:40:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:40:21,559][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:40:22,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:40:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:40:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:40:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:40:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:40:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:40:26,166][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:40:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:40:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:40:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:40:28,799][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:40:29,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:40:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:40:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:40:31,433][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:40:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:40:32,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:40:33,407][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:40:34,066][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:40:34,725][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:40:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:40:36,041][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:40:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:40:37,358][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:40:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:40:38,677][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:40:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:40:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:40:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:40:41,311][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:40:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:40:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:40:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:40:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:40:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:40:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:40:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:40:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:40:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:40:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:40:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:40:49,450][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:40:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:40:50,768][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:40:51,427][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:40:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:40:52,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:40:53,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:40:54,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:40:54,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:40:54,709][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:40:56,047][__main__][INFO] - Iteration 425 took 55s (15.96% Gen, 81.64% Train). Generation: 8s, Training: 45s. Estimated remaining time: 9h 11m 49s. Estimated total time: 15h 31m 21s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 40s. [2026-03-25 20:40:56,050][__main__][INFO] - Starting iteration 425. [2026-03-25 20:40:56,056][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:40:56,057][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:41:06,074][__main__][INFO] - Number of regex retries in iteration 425: 0 [2026-03-25 20:41:06,076][__main__][INFO] - agents played in iteration 425 are Alice, Bob [2026-03-25 20:41:06,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:06,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:06,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:41:06,684][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:41:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:41:07,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:41:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:41:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:41:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:41:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:41:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:41:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:41:12,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:41:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:41:13,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:41:14,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:41:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:41:15,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:41:16,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:41:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:41:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:41:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:41:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:41:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:41:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:41:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:41:21,786][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:41:22,444][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:41:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:41:23,762][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:41:24,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:41:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:41:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:41:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:41:27,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:41:27,715][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:41:28,375][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:41:29,033][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:41:29,692][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:41:30,351][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:41:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:41:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:41:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:41:32,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:41:33,645][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:41:34,303][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:41:34,963][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:41:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:41:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:41:36,939][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:41:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:41:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:41:39,215][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:41:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:41:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:41:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:41:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:41:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:41:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:41:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:41:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:41:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:41:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:41:46,459][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:41:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:41:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:41:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:41:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:41:49,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:41:50,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:41:51,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:41:51,534][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:41:51,535][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:41:52,724][__main__][INFO] - Iteration 426 took 56s (17.68% Gen, 80.22% Train). Generation: 10s, Training: 45s. Estimated remaining time: 9h 24m 2s. Estimated total time: 15h 44m 31s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 15s. [2026-03-25 20:41:52,726][__main__][INFO] - Starting iteration 426. [2026-03-25 20:41:52,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:41:52,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:41:58,169][__main__][INFO] - Number of regex retries in iteration 426: 0 [2026-03-25 20:41:58,171][__main__][INFO] - agents played in iteration 426 are Alice, Bob [2026-03-25 20:41:58,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:58,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:58,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:41:58,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:41:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:42:00,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:42:00,865][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:42:01,525][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:42:02,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:42:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:42:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:42:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:42:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:42:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:42:06,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:42:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:42:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:42:08,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:42:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:42:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:42:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:42:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:42:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:42:12,102][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:42:12,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:42:13,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:42:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:42:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:42:15,396][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:42:16,055][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:42:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:42:17,373][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:42:18,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:42:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:42:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:42:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:42:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:42:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:42:21,989][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:42:22,648][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:42:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:42:23,968][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:42:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:42:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:42:25,949][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:42:26,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:42:27,269][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:42:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:42:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:42:29,247][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:42:29,907][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:42:30,567][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:42:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:42:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:42:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:42:33,586][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:42:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:42:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:42:35,565][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:42:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:42:36,886][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:42:37,547][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:42:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:42:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:42:39,526][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:42:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:42:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:42:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:42:42,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:42:42,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:42:44,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:42:44,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:42:44,169][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:42:45,674][__main__][INFO] - Iteration 427 took 52s (10.27% Gen, 86.88% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 21m 3s. Estimated total time: 14h 42m 25s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 14s, 500 more iterations: 7h 21m 12s. [2026-03-25 20:42:45,676][__main__][INFO] - Starting iteration 427. [2026-03-25 20:42:45,680][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:42:45,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:42:50,657][__main__][INFO] - Number of regex retries in iteration 427: 0 [2026-03-25 20:42:50,658][__main__][INFO] - agents played in iteration 427 are Alice, Bob [2026-03-25 20:42:51,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:42:51,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:42:51,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:42:51,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:42:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:42:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:42:53,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:42:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:42:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:42:55,191][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:42:55,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:42:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:42:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:42:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:42:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:42:59,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:42:59,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:43:00,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:43:01,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:43:01,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:43:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:43:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:43:03,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:43:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:43:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:43:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:43:06,413][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:43:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:43:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:43:08,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:43:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:43:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:43:10,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:43:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:43:11,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:43:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:43:13,007][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:43:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:43:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:43:14,986][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:43:15,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:43:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:43:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:43:17,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:43:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:43:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:43:19,604][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:43:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:43:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:43:21,582][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:43:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:43:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:43:23,887][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:43:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:43:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:43:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:43:26,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:43:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:43:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:43:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:43:29,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:43:29,826][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:43:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:43:31,149][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:43:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:43:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:43:33,129][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:43:33,789][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:43:34,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:43:35,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:43:36,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:43:36,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:43:36,454][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:43:37,850][__main__][INFO] - Iteration 428 took 52s (9.54% Gen, 87.78% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 7m 18s. Estimated total time: 14h 29m 32s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 46s. [2026-03-25 20:43:37,853][__main__][INFO] - Starting iteration 428. [2026-03-25 20:43:37,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:43:37,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:43:48,739][__main__][INFO] - Number of regex retries in iteration 428: 0 [2026-03-25 20:43:48,741][__main__][INFO] - agents played in iteration 428 are Alice, Bob [2026-03-25 20:43:49,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:49,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:49,293][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:43:49,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:43:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:43:50,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:43:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:43:51,943][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:43:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:43:53,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:43:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:43:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:43:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:43:55,907][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:43:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:43:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:43:57,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:43:58,548][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:43:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:43:59,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:44:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:44:01,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:44:01,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:44:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:44:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:44:03,825][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:44:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:44:05,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:44:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:44:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:44:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:44:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:44:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:44:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:44:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:44:10,418][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:44:11,077][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:44:11,736][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:44:12,395][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:44:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:44:13,717][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:44:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:44:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:44:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:44:16,356][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:44:17,015][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:44:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:44:18,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:44:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:44:19,654][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:44:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:44:20,974][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:44:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:44:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:44:23,320][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:44:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:44:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:44:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:44:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:44:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:44:27,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:44:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:44:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:44:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:44:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:44:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:44:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:44:31,901][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:44:32,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:44:33,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:44:34,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:44:34,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:44:34,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:44:35,828][__main__][INFO] - Iteration 429 took 57s (18.77% Gen, 79.02% Train). Generation: 10s, Training: 45s. Estimated remaining time: 9h 43m 1s. Estimated total time: 16h 6m 12s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 37s, 500 more iterations: 8h 3m 6s. [2026-03-25 20:44:35,831][__main__][INFO] - Starting iteration 429. [2026-03-25 20:44:35,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:44:35,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:44:40,685][__main__][INFO] - Number of regex retries in iteration 429: 0 [2026-03-25 20:44:40,686][__main__][INFO] - agents played in iteration 429 are Alice, Bob [2026-03-25 20:44:41,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:44:41,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:44:41,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:44:41,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:44:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:44:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:44:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:44:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:44:44,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:44:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:44:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:44:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:44:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:44:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:44:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:44:49,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:44:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:44:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:44:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:44:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:44:52,565][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:44:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:44:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:44:54,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:44:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:44:55,870][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:44:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:44:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:44:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:44:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:44:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:44:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:45:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:45:01,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:45:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:45:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:45:03,135][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:45:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:45:04,452][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:45:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:45:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:45:06,431][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:45:07,090][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:45:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:45:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:45:09,066][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:45:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:45:10,384][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:45:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:45:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:45:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:45:13,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:45:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:45:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:45:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:45:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:45:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:45:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:45:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:45:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:45:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:45:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:45:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:45:21,266][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:45:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:45:22,585][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:45:23,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:45:23,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:45:24,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:45:25,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:45:26,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:45:26,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:45:26,456][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:45:27,795][__main__][INFO] - Iteration 430 took 51s (9.34% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 8h 1m 58s. Estimated total time: 14h 26m 2s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 1s. [2026-03-25 20:45:27,798][__main__][INFO] - Starting iteration 430. [2026-03-25 20:45:27,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:45:27,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:45:32,936][__main__][INFO] - Number of regex retries in iteration 430: 0 [2026-03-25 20:45:32,937][__main__][INFO] - agents played in iteration 430 are Alice, Bob [2026-03-25 20:45:33,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:45:33,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:45:33,499][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:45:33,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:45:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:45:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:45:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:45:36,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:45:36,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:45:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:45:38,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:45:38,777][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:45:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:45:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:45:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:45:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:45:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:45:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:45:43,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:45:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:45:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:45:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:45:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:45:46,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:45:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:45:48,017][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:45:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:45:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:45:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:45:50,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:45:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:45:51,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:45:52,632][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:45:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:45:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:45:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:45:55,271][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:45:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:45:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:45:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:45:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:45:58,569][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:45:59,228][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:45:59,887][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:46:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:46:01,207][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:46:01,866][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:46:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:46:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:46:03,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:46:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:46:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:46:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:46:06,818][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:46:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:46:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:46:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:46:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:46:10,113][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:46:10,772][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:46:11,433][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:46:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:46:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:46:13,412][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:46:14,070][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:46:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:46:15,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:46:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:46:16,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:46:17,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:46:18,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:46:18,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:46:18,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:46:19,928][__main__][INFO] - Iteration 431 took 52s (9.85% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 3m 52s. Estimated total time: 14h 28m 48s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 24s. [2026-03-25 20:46:19,930][__main__][INFO] - Starting iteration 431. [2026-03-25 20:46:19,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:46:19,935][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:46:25,583][__main__][INFO] - Number of regex retries in iteration 431: 0 [2026-03-25 20:46:25,584][__main__][INFO] - agents played in iteration 431 are Alice, Bob [2026-03-25 20:46:26,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:46:26,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:46:26,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:46:26,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:46:26,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:46:27,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:46:28,262][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:46:28,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:46:29,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:46:30,236][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:46:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:46:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:46:32,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:46:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:46:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:46:34,188][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:46:34,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:46:35,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:46:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:46:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:46:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:46:38,145][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:46:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:46:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:46:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:46:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:46:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:46:42,106][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:46:42,765][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:46:43,426][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:46:44,087][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:46:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:46:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:46:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:46:46,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:46:47,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:46:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:46:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:46:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:46:50,025][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:46:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:46:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:46:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:46:52,669][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:46:53,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:46:53,990][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:46:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:46:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:46:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:46:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:46:57,290][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:46:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:46:58,941][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:46:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:47:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:47:00,918][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:47:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:47:02,236][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:47:02,897][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:47:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:47:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:47:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:47:05,534][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:47:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:47:06,854][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:47:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:47:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:47:08,834][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:47:09,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:47:10,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:47:11,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:47:11,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:47:12,241][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:47:13,534][__main__][INFO] - Iteration 432 took 53s (10.54% Gen, 87.04% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 27m 32s. Estimated total time: 14h 53m 22s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 20s, 500 more iterations: 7h 26m 41s. [2026-03-25 20:47:13,536][__main__][INFO] - Starting iteration 432. [2026-03-25 20:47:13,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:47:13,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:47:18,998][__main__][INFO] - Number of regex retries in iteration 432: 0 [2026-03-25 20:47:18,998][__main__][INFO] - agents played in iteration 432 are Alice, Bob [2026-03-25 20:47:19,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:47:19,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:47:19,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:47:19,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:47:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:47:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:47:21,641][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:47:22,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:47:22,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:47:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:47:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:47:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:47:25,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:47:26,261][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:47:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:47:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:47:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:47:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:47:29,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:47:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:47:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:47:31,546][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:47:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:47:32,865][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:47:33,524][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:47:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:47:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:47:35,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:47:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:47:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:47:37,481][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:47:38,141][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:47:38,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:47:39,461][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:47:40,120][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:47:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:47:41,439][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:47:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:47:42,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:47:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:47:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:47:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:47:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:47:46,055][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:47:46,715][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:47:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:47:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:47:48,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:47:49,352][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:47:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:47:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:47:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:47:52,323][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:47:52,982][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:47:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:47:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:47:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:47:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:47:56,281][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:47:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:47:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:47:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:47:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:47:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:48:00,234][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:48:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:48:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:48:02,213][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:48:02,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:48:03,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:48:04,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:48:04,797][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:48:04,798][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:48:06,112][__main__][INFO] - Iteration 433 took 52s (10.38% Gen, 87.12% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 9m 31s. Estimated total time: 14h 36m 13s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 6s. [2026-03-25 20:48:06,115][__main__][INFO] - Starting iteration 433. [2026-03-25 20:48:06,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:48:06,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:48:14,270][__main__][INFO] - Number of regex retries in iteration 433: 0 [2026-03-25 20:48:14,271][__main__][INFO] - agents played in iteration 433 are Alice, Bob [2026-03-25 20:48:14,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:48:14,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:48:14,936][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:48:14,936][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:48:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:48:16,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:48:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:48:17,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:48:18,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:48:19,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:48:19,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:48:20,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:48:20,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:48:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:48:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:48:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:48:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:48:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:48:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:48:25,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:48:26,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:48:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:48:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:48:28,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:48:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:48:29,556][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:48:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:48:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:48:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:48:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:48:32,850][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:48:33,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:48:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:48:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:48:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:48:36,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:48:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:48:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:48:38,124][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:48:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:48:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:48:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:48:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:48:41,422][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:48:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:48:42,741][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:48:43,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:48:44,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:48:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:48:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:48:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:48:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:48:47,698][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:48:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:48:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:48:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:48:50,337][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:48:50,997][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:48:51,656][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:48:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:48:52,973][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:48:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:48:54,292][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:48:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:48:55,612][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:48:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:48:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:48:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:48:58,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:48:59,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:49:00,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:49:00,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:49:00,251][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:49:01,513][__main__][INFO] - Iteration 434 took 55s (14.71% Gen, 83.00% Train). Generation: 8s, Training: 45s. Estimated remaining time: 8h 55m 37s. Estimated total time: 15h 23m 14s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 37s. [2026-03-25 20:49:01,515][__main__][INFO] - Starting iteration 434. [2026-03-25 20:49:01,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:49:01,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:49:06,311][__main__][INFO] - Number of regex retries in iteration 434: 0 [2026-03-25 20:49:06,312][__main__][INFO] - agents played in iteration 434 are Alice, Bob [2026-03-25 20:49:06,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:06,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:06,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:49:06,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:49:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:49:08,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:49:08,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:49:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:49:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:49:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:49:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:49:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:49:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:49:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:49:14,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:49:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:49:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:49:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:49:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:49:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:49:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:49:18,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:49:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:49:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:49:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:49:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:49:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:49:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:49:23,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:49:24,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:49:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:49:25,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:49:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:49:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:49:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:49:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:49:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:49:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:49:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:49:30,720][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:49:31,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:49:32,041][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:49:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:49:33,363][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:49:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:49:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:49:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:49:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:49:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:49:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:49:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:49:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:49:39,652][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:49:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:49:40,973][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:49:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:49:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:49:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:49:43,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:49:44,271][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:49:44,931][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:49:45,590][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:49:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:49:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:49:47,572][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:49:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:49:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:49:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:49:50,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:49:51,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:49:52,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:49:52,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:49:52,175][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:49:53,505][__main__][INFO] - Iteration 435 took 51s (9.22% Gen, 88.22% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 57m 58s. Estimated total time: 14h 26m 28s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 14s. [2026-03-25 20:49:53,508][__main__][INFO] - Starting iteration 435. [2026-03-25 20:49:53,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:49:53,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:49:58,448][__main__][INFO] - Number of regex retries in iteration 435: 0 [2026-03-25 20:49:58,449][__main__][INFO] - agents played in iteration 435 are Alice, Bob [2026-03-25 20:49:58,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:59,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:59,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:49:59,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:49:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:50:00,365][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:50:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:50:01,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:50:02,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:50:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:50:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:50:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:50:04,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:50:05,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:50:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:50:06,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:50:07,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:50:08,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:50:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:50:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:50:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:50:10,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:50:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:50:12,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:50:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:50:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:50:14,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:50:14,882][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:50:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:50:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:50:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:50:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:50:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:50:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:50:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:50:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:50:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:50:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:50:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:50:22,790][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:50:23,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:50:24,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:50:24,770][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:50:25,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:50:26,092][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:50:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:50:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:50:28,071][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:50:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:50:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:50:30,051][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:50:30,709][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:50:31,714][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:50:32,374][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:50:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:50:33,693][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:50:34,351][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:50:35,010][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:50:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:50:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:50:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:50:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:50:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:50:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:50:39,621][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:50:40,280][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:50:40,938][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:50:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:50:42,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:50:43,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:50:44,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:50:44,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:50:44,325][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:50:45,596][__main__][INFO] - Iteration 436 took 52s (9.48% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 58m 45s. Estimated total time: 14h 28m 6s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 3s. [2026-03-25 20:50:45,599][__main__][INFO] - Starting iteration 436. [2026-03-25 20:50:45,603][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:50:45,604][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:50:50,596][__main__][INFO] - Number of regex retries in iteration 436: 0 [2026-03-25 20:50:50,597][__main__][INFO] - agents played in iteration 436 are Alice, Bob [2026-03-25 20:50:51,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:50:51,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:50:51,269][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:50:51,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:50:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:50:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:50:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:50:53,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:50:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:50:55,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:50:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:50:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:50:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:50:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:50:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:50:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:50:59,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:51:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:51:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:51:01,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:51:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:51:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:51:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:51:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:51:05,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:51:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:51:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:51:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:51:07,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:51:08,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:51:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:51:09,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:51:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:51:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:51:11,692][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:51:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:51:13,010][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:51:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:51:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:51:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:51:15,647][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:51:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:51:16,964][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:51:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:51:18,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:51:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:51:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:51:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:51:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:51:21,586][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:51:22,246][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:51:22,904][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:51:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:51:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:51:25,223][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:51:25,883][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:51:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:51:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:51:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:51:28,517][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:51:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:51:29,839][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:51:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:51:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:51:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:51:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:51:33,135][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:51:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:51:34,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:51:35,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:51:36,444][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:51:36,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:51:36,448][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:51:37,810][__main__][INFO] - Iteration 437 took 52s (9.56% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 59m 55s. Estimated total time: 14h 30m 9s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 4s. [2026-03-25 20:51:37,812][__main__][INFO] - Starting iteration 437. [2026-03-25 20:51:37,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:51:37,816][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:51:42,912][__main__][INFO] - Number of regex retries in iteration 437: 0 [2026-03-25 20:51:42,913][__main__][INFO] - agents played in iteration 437 are Alice, Bob [2026-03-25 20:51:43,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:51:43,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:51:43,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:51:43,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:51:44,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:51:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:51:45,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:51:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:51:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:51:47,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:51:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:51:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:51:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:51:50,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:51:50,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:51:51,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:51:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:51:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:51:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:51:54,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:51:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:51:55,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:51:56,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:51:56,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:51:57,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:51:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:51:58,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:51:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:52:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:52:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:52:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:52:02,097][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:52:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:52:03,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:52:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:52:04,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:52:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:52:06,048][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:52:06,706][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:52:07,365][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:52:08,026][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:52:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:52:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:52:10,003][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:52:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:52:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:52:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:52:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:52:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:52:13,954][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:52:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:52:15,271][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:52:16,263][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:52:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:52:17,578][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:52:18,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:52:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:52:19,554][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:52:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:52:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:52:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:52:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:52:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:52:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:52:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:52:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:52:25,486][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:52:26,148][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:52:26,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:52:27,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:52:28,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:52:28,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:52:28,697][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:52:29,976][__main__][INFO] - Iteration 438 took 52s (9.77% Gen, 87.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 58m 15s. Estimated total time: 14h 29m 21s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 40s. [2026-03-25 20:52:29,978][__main__][INFO] - Starting iteration 438. [2026-03-25 20:52:29,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:52:29,983][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:52:35,910][__main__][INFO] - Number of regex retries in iteration 438: 0 [2026-03-25 20:52:35,911][__main__][INFO] - agents played in iteration 438 are Alice, Bob [2026-03-25 20:52:36,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:52:36,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:52:36,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:52:36,502][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:52:37,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:52:37,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:52:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:52:39,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:52:39,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:52:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:52:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:52:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:52:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:52:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:52:43,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:52:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:52:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:52:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:52:46,376][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:52:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:52:47,694][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:52:48,353][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:52:49,011][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:52:49,670][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:52:50,329][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:52:50,987][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:52:51,646][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:52:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:52:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:52:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:52:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:52:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:52:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:52:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:52:56,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:52:57,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:52:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:52:58,891][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:52:59,551][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:53:00,210][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:53:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:53:01,529][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:53:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:53:02,849][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:53:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:53:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:53:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:53:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:53:06,142][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:53:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:53:07,460][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:53:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:53:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:53:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:53:10,435][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:53:11,095][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:53:11,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:53:12,413][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:53:13,074][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:53:13,733][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:53:14,394][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:53:15,054][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:53:15,712][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:53:16,374][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:53:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:53:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:53:18,361][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:53:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:53:19,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:53:20,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:53:21,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:53:21,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:53:21,689][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:53:23,927][__main__][INFO] - Iteration 439 took 53s (10.99% Gen, 84.86% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 27m 6s. Estimated total time: 14h 59m 6s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 54s, 500 more iterations: 7h 29m 33s. [2026-03-25 20:53:23,930][__main__][INFO] - Starting iteration 439. [2026-03-25 20:53:23,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:53:23,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:53:29,122][__main__][INFO] - Number of regex retries in iteration 439: 0 [2026-03-25 20:53:29,123][__main__][INFO] - agents played in iteration 439 are Alice, Bob [2026-03-25 20:53:29,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:53:29,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:53:29,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:53:29,784][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:53:30,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:53:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:53:31,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:53:32,431][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:53:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:53:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:53:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:53:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:53:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:53:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:53:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:53:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:53:38,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:53:39,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:53:39,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:53:40,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:53:41,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:53:41,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:53:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:53:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:53:43,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:53:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:53:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:53:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:53:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:53:46,939][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:53:47,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:53:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:53:48,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:53:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:53:50,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:53:50,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:53:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:53:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:53:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:53:53,539][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:53:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:53:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:53:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:53:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:53:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:53:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:53:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:53:58,814][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:53:59,473][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:54:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:54:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:54:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:54:02,457][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:54:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:54:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:54:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:54:05,097][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:54:05,757][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:54:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:54:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:54:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:54:08,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:54:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:54:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:54:10,369][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:54:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:54:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:54:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:54:13,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:54:13,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:54:14,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:54:14,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:54:14,923][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:54:16,347][__main__][INFO] - Iteration 440 took 52s (9.90% Gen, 87.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 0m 43s. Estimated total time: 14h 33m 35s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 47s. [2026-03-25 20:54:16,353][__main__][INFO] - Starting iteration 440. [2026-03-25 20:54:16,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:54:16,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:54:18,808][mllm.models.large_language_model_local][WARNING] - Response `)))`)))`)))`)))`)))`)))`)))`)))`)))`)))`)))`)))`)))) did not match regex: (|), retry 1/1 [2026-03-25 20:54:21,892][__main__][INFO] - Number of regex retries in iteration 440: 1 [2026-03-25 20:54:21,894][__main__][INFO] - agents played in iteration 440 are Alice, Bob [2026-03-25 20:54:22,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:54:22,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:54:22,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:54:22,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:54:23,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:54:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:54:24,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:54:25,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:54:25,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:54:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:54:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:54:27,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:54:28,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:54:29,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:54:29,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:54:30,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:54:31,265][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:54:31,925][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:54:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:54:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:54:33,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:54:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:54:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:54:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:54:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:54:37,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:54:37,876][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:54:38,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:54:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:54:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:54:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:54:41,177][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:54:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:54:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:54:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:54:43,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:54:44,497][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:54:45,156][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:54:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:54:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:54:47,138][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:54:47,799][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:54:48,461][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:54:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:54:49,781][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:54:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:54:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:54:51,764][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:54:52,425][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:54:53,087][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:54:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:54:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:54:55,423][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:54:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:54:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:54:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:54:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:54:58,734][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:54:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:55:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:55:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:55:01,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:55:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:55:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:55:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:55:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:55:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:55:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:55:05,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:55:06,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:55:08,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:55:08,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:55:08,049][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:55:09,289][__main__][INFO] - Iteration 441 took 52s (10.46% Gen, 87.19% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 8m 28s. Estimated total time: 14h 42m 13s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 13s, 500 more iterations: 7h 21m 6s. [2026-03-25 20:55:09,292][__main__][INFO] - Starting iteration 441. [2026-03-25 20:55:09,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:55:09,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:55:14,425][__main__][INFO] - Number of regex retries in iteration 441: 0 [2026-03-25 20:55:14,426][__main__][INFO] - agents played in iteration 441 are Alice, Bob [2026-03-25 20:55:14,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:14,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:14,983][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:55:14,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:55:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:55:16,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:55:17,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:55:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:55:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:55:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:55:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:55:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:55:20,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:55:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:55:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:55:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:55:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:55:24,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:55:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:55:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:55:26,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:55:26,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:55:27,576][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:55:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:55:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:55:29,553][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:55:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:55:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:55:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:55:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:55:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:55:33,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:55:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:55:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:55:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:55:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:55:36,811][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:55:37,469][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:55:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:55:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:55:39,447][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:55:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:55:40,766][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:55:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:55:42,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:55:42,741][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:55:43,399][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:55:44,058][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:55:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:55:45,376][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:55:46,034][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:55:46,693][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:55:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:55:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:55:49,148][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:55:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:55:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:55:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:55:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:55:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:55:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:55:53,763][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:55:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:55:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:55:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:55:56,401][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:55:57,060][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:55:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:55:58,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:55:59,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:56:00,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:56:00,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:56:00,562][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:56:02,110][__main__][INFO] - Iteration 442 took 52s (9.71% Gen, 87.35% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 5m 38s. Estimated total time: 14h 40m 17s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 8s. [2026-03-25 20:56:02,113][__main__][INFO] - Starting iteration 442. [2026-03-25 20:56:02,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:56:02,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:56:08,751][__main__][INFO] - Number of regex retries in iteration 442: 0 [2026-03-25 20:56:08,752][__main__][INFO] - agents played in iteration 442 are Alice, Bob [2026-03-25 20:56:09,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:56:09,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:56:09,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:56:09,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:56:10,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:56:10,712][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:56:11,373][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:56:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:56:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:56:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:56:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:56:14,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:56:15,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:56:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:56:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:56:17,306][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:56:17,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:56:18,627][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:56:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:56:19,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:56:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:56:21,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:56:21,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:56:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:56:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:56:23,907][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:56:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:56:25,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:56:25,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:56:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:56:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:56:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:56:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:56:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:56:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:56:30,498][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:56:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:56:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:56:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:56:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:56:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:56:34,454][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:56:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:56:35,771][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:56:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:56:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:56:37,750][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:56:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:56:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:56:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:56:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:56:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:56:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:56:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:56:43,371][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:56:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:56:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:56:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:56:46,008][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:56:46,667][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:56:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:56:47,985][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:56:48,644][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:56:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:56:49,963][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:56:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:56:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:56:51,940][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:56:52,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:56:53,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:56:54,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:56:54,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:56:54,404][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:56:55,798][__main__][INFO] - Iteration 443 took 53s (12.36% Gen, 85.04% Train). Generation: 6s, Training: 45s. Estimated remaining time: 8h 19m 11s. Estimated total time: 14h 54m 43s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 28s, 500 more iterations: 7h 27m 21s. [2026-03-25 20:56:55,800][__main__][INFO] - Starting iteration 443. [2026-03-25 20:56:55,804][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:56:55,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:57:00,680][__main__][INFO] - Number of regex retries in iteration 443: 0 [2026-03-25 20:57:00,681][__main__][INFO] - agents played in iteration 443 are Alice, Bob [2026-03-25 20:57:01,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:01,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:01,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:57:01,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:57:02,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:57:02,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:57:03,321][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:57:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:57:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:57:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:57:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:57:06,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:57:07,282][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:57:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:57:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:57:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:57:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:57:10,584][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:57:11,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:57:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:57:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:57:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:57:13,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:57:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:57:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:57:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:57:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:57:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:57:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:57:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:57:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:57:19,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:57:20,477][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:57:21,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:57:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:57:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:57:23,119][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:57:23,778][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:57:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:57:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:57:25,756][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:57:26,417][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:57:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:57:27,735][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:57:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:57:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:57:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:57:30,370][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:57:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:57:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:57:32,348][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:57:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:57:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:57:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:57:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:57:35,974][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:57:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:57:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:57:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:57:38,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:57:39,271][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:57:39,930][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:57:40,589][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:57:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:57:41,906][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:57:42,564][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:57:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:57:43,881][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:57:44,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:57:45,301][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:57:46,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:57:46,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:57:46,497][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:57:47,810][__main__][INFO] - Iteration 444 took 52s (9.38% Gen, 88.09% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 50m 24s. Estimated total time: 14h 26m 48s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 24s. [2026-03-25 20:57:47,816][__main__][INFO] - Starting iteration 444. [2026-03-25 20:57:47,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:57:47,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:57:56,777][__main__][INFO] - Number of regex retries in iteration 444: 0 [2026-03-25 20:57:56,778][__main__][INFO] - agents played in iteration 444 are Alice, Bob [2026-03-25 20:57:57,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:57,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:57,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:57:57,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:57:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:57:58,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:57:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:58:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:58:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:58:01,469][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:58:02,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:58:02,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:58:03,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:58:04,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:58:04,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:58:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:58:06,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:58:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:58:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:58:08,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:58:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:58:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:58:10,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:58:10,721][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:58:11,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:58:12,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:58:12,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:58:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:58:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:58:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:58:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:58:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:58:16,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:58:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:58:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:58:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:58:19,304][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:58:19,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:58:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:58:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:58:21,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:58:22,596][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:58:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:58:23,912][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:58:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:58:25,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:58:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:58:26,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:58:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:58:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:58:28,523][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:58:29,181][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:58:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:58:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:58:31,494][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:58:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:58:32,809][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:58:33,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:58:34,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:58:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:58:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:58:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:58:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:58:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:58:38,072][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:58:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:58:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:58:40,048][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:58:40,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:58:41,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:58:42,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:58:42,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:58:42,632][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:58:43,987][__main__][INFO] - Iteration 445 took 56s (15.95% Gen, 81.63% Train). Generation: 8s, Training: 45s. Estimated remaining time: 8h 58m 49s. Estimated total time: 15h 36m 9s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 4s. [2026-03-25 20:58:43,991][__main__][INFO] - Starting iteration 445. [2026-03-25 20:58:43,997][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:58:43,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:59:01,314][__main__][INFO] - Number of regex retries in iteration 445: 0 [2026-03-25 20:59:01,316][__main__][INFO] - agents played in iteration 445 are Alice, Bob [2026-03-25 20:59:01,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:02,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:02,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:59:02,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:59:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:59:03,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:59:03,963][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:59:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:59:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:59:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:59:06,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:59:07,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:59:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:59:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:59:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:59:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:59:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:59:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:59:11,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:59:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:59:13,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:59:13,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:59:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:59:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:59:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:59:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:59:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:59:17,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:59:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:59:19,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:59:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:59:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:59:21,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:59:21,719][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:59:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:59:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:59:23,694][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:59:24,352][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:59:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:59:25,668][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:59:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:59:26,983][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:59:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:59:28,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:59:28,957][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:59:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:59:30,275][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:59:30,933][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:59:31,592][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:59:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:59:32,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:59:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:59:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:59:35,243][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:59:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:59:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:59:37,228][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:59:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:59:38,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:59:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:59:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:59:40,541][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:59:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:59:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:59:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:59:43,188][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:59:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:59:44,511][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:59:45,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:59:45,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:59:47,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:59:47,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:59:47,140][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:59:48,454][__main__][INFO] - Iteration 446 took 1m 4s (26.87% Gen, 71.09% Train). Generation: 17s, Training: 45s. Estimated remaining time: 11h 15m 54s. Estimated total time: 17h 54m 18s. Time estimates for 10 more iterations: 10m 44s, 100 more iterations: 1h 47m 25s, 500 more iterations: 8h 57m 9s. [2026-03-25 20:59:48,457][__main__][INFO] - Starting iteration 446. [2026-03-25 20:59:48,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:59:48,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:59:53,311][__main__][INFO] - Number of regex retries in iteration 446: 0 [2026-03-25 20:59:53,312][__main__][INFO] - agents played in iteration 446 are Alice, Bob [2026-03-25 20:59:53,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:53,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:53,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:59:53,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:59:54,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:59:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:59:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:59:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:59:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:59:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:59:58,619][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:59:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:59:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:00:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:00:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:00:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:00:02,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:00:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:00:03,898][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:00:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:00:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:00:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:00:06,530][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:00:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:00:07,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:00:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:00:09,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:00:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:00:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:00:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:00:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:00:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:00:13,112][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:00:13,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:00:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:00:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:00:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:00:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:00:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:00:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:00:18,382][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:00:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:00:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:00:20,355][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:00:21,013][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:00:21,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:00:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:00:22,992][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:00:23,651][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:00:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:00:24,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:00:25,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:00:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:00:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:00:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:00:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:00:29,253][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:00:29,911][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:00:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:00:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:00:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:00:32,551][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:00:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:00:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:00:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:00:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:00:35,870][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:00:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:00:37,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:00:38,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:00:39,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:00:39,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:00:39,332][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:00:40,527][__main__][INFO] - Iteration 447 took 52s (9.32% Gen, 88.38% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 48m 31s. Estimated total time: 14h 27m 47s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 53s. [2026-03-25 21:00:42,640][__main__][INFO] - Starting iteration 447. [2026-03-25 21:00:42,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:00:42,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:00:47,830][__main__][INFO] - Number of regex retries in iteration 447: 0 [2026-03-25 21:00:47,832][__main__][INFO] - agents played in iteration 447 are Alice, Bob [2026-03-25 21:00:48,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:00:48,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:00:48,720][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:00:48,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:00:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:00:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:00:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:00:51,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:00:51,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:00:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:00:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:00:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:00:54,619][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:00:55,277][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:00:55,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:00:56,590][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:00:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:00:57,904][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:00:58,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:00:59,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:00:59,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:01:00,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:01:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:01:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:01:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:01:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:01:03,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:01:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:01:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:01:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:01:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:01:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:01:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:01:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:01:09,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:01:09,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:01:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:01:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:01:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:01:12,359][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:01:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:01:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:01:14,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:01:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:01:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:01:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:01:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:01:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:01:18,311][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:01:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:01:19,636][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:01:20,298][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:01:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:01:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:01:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:01:23,274][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:01:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:01:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:01:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:01:25,924][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:01:26,585][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:01:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:01:27,899][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:01:28,556][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:01:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:01:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:01:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:01:31,185][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:01:31,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:01:32,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:01:33,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:01:33,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:01:33,822][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:01:35,102][__main__][INFO] - Iteration 448 took 52s (9.88% Gen, 87.67% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 54m 8s. Estimated total time: 14h 34m 19s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 9s. [2026-03-25 21:01:35,105][__main__][INFO] - Starting iteration 448. [2026-03-25 21:01:35,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:01:35,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:01:40,249][__main__][INFO] - Number of regex retries in iteration 448: 0 [2026-03-25 21:01:40,251][__main__][INFO] - agents played in iteration 448 are Alice, Bob [2026-03-25 21:01:41,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:01:41,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:01:41,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:01:41,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:01:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:01:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:01:43,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:01:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:01:44,715][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:01:45,371][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:01:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:01:46,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:01:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:01:47,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:01:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:01:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:01:49,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:01:50,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:01:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:01:51,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:01:52,596][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:01:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:01:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:01:54,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:01:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:01:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:01:56,544][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:01:57,207][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:01:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:01:58,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:01:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:01:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:02:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:02:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:02:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:02:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:02:03,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:02:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:02:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:02:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:02:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:02:06,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:02:07,122][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:02:07,786][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:02:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:02:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:02:09,760][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:02:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:02:11,074][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:02:11,731][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:02:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:02:13,045][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:02:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:02:14,696][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:02:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:02:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:02:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:02:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:02:17,981][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:02:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:02:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:02:19,952][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:02:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:02:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:02:21,924][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:02:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:02:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:02:23,895][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:02:24,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:02:25,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:02:26,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:02:26,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:02:26,586][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:02:29,570][__main__][INFO] - Iteration 449 took 54s (9.44% Gen, 85.08% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 26m 37s. Estimated total time: 15h 7m 42s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 46s, 500 more iterations: 7h 33m 51s. [2026-03-25 21:02:29,574][__main__][INFO] - Starting iteration 449. [2026-03-25 21:02:29,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:02:29,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:02:34,597][__main__][INFO] - Number of regex retries in iteration 449: 0 [2026-03-25 21:02:34,598][__main__][INFO] - agents played in iteration 449 are Alice, Bob [2026-03-25 21:02:35,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:02:35,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:02:35,565][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:02:35,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:02:36,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:02:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:02:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:02:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:02:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:02:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:02:40,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:02:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:02:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:02:42,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:02:42,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:02:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:02:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:02:44,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:02:45,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:02:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:02:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:02:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:02:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:02:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:02:49,349][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:02:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:02:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:02:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:02:51,976][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:02:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:02:53,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:02:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:02:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:02:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:02:55,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:02:56,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:02:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:02:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:02:58,544][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:02:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:02:59,857][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:03:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:03:01,171][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:03:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:03:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:03:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:03:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:03:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:03:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:03:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:03:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:03:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:03:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:03:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:03:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:03:10,044][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:03:10,701][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:03:11,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:03:12,015][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:03:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:03:13,328][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:03:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:03:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:03:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:03:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:03:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:03:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:03:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:03:18,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:03:19,313][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:03:20,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:03:20,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:03:20,544][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:03:27,151][__main__][INFO] - Iteration 450 took 57s (8.72% Gen, 79.80% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 17m 32s. Estimated total time: 15h 59m 35s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 57s, 500 more iterations: 7h 59m 47s. [2026-03-25 21:03:27,155][__main__][INFO] - Starting iteration 450. [2026-03-25 21:03:27,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:03:27,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:03:32,166][__main__][INFO] - Number of regex retries in iteration 450: 0 [2026-03-25 21:03:32,167][__main__][INFO] - agents played in iteration 450 are Alice, Bob [2026-03-25 21:03:32,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:03:33,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:03:33,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:03:33,048][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:03:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:03:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:03:35,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:03:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:03:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:03:36,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:03:37,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:03:38,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:03:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:03:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:03:40,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:03:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:03:41,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:03:42,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:03:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:03:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:03:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:03:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:03:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:03:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:03:46,856][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:03:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:03:48,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:03:48,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:03:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:03:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:03:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:03:51,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:03:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:03:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:03:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:03:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:03:54,743][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:03:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:03:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:03:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:03:57,371][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:03:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:03:58,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:03:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:03:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:04:00,656][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:04:01,315][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:04:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:04:02,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:04:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:04:03,947][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:04:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:04:05,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:04:06,248][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:04:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:04:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:04:08,222][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:04:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:04:09,538][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:04:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:04:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:04:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:04:12,171][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:04:12,831][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:04:13,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:04:14,149][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:04:14,807][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:04:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:04:16,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:04:16,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:04:18,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:04:18,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:04:18,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:04:20,851][__main__][INFO] - Iteration 451 took 53s (9.32% Gen, 85.40% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 11m 55s. Estimated total time: 14h 54m 52s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 29s, 500 more iterations: 7h 27m 26s. [2026-03-25 21:04:20,853][__main__][INFO] - Starting iteration 451. [2026-03-25 21:04:20,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:04:20,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:04:25,901][__main__][INFO] - Number of regex retries in iteration 451: 0 [2026-03-25 21:04:25,903][__main__][INFO] - agents played in iteration 451 are Alice, Bob [2026-03-25 21:04:26,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:04:26,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:04:26,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:04:26,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:04:27,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:04:28,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:04:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:04:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:04:30,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:04:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:04:31,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:04:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:04:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:04:33,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:04:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:04:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:04:35,412][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:04:36,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:04:36,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:04:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:04:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:04:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:04:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:04:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:04:40,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:04:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:04:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:04:42,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:04:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:04:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:04:44,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:04:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:04:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:04:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:04:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:04:47,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:04:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:04:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:04:49,865][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:04:50,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:04:51,178][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:04:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:04:52,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:04:53,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:04:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:04:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:04:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:04:55,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:04:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:04:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:04:57,747][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:04:58,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:04:59,400][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:05:00,057][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:05:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:05:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:05:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:05:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:05:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:05:03,998][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:05:04,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:05:05,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:05:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:05:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:05:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:05:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:05:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:05:09,255][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:05:09,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:05:10,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:05:11,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:05:11,724][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:05:11,726][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:05:13,177][__main__][INFO] - Iteration 452 took 52s (9.64% Gen, 87.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 48m 12s. Estimated total time: 14h 32m 2s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 1s. [2026-03-25 21:05:13,180][__main__][INFO] - Starting iteration 452. [2026-03-25 21:05:13,184][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:05:13,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:05:18,011][__main__][INFO] - Number of regex retries in iteration 452: 0 [2026-03-25 21:05:18,012][__main__][INFO] - agents played in iteration 452 are Alice, Bob [2026-03-25 21:05:18,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:05:18,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:05:18,756][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:05:18,757][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:05:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:05:20,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:05:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:05:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:05:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:05:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:05:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:05:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:05:24,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:05:25,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:05:25,978][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:05:26,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:05:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:05:27,950][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:05:28,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:05:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:05:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:05:30,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:05:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:05:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:05:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:05:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:05:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:05:34,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:05:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:05:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:05:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:05:37,152][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:05:37,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:05:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:05:39,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:05:39,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:05:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:05:41,095][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:05:41,752][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:05:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:05:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:05:43,723][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:05:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:05:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:05:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:05:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:05:47,008][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:05:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:05:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:05:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:05:49,636][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:05:50,293][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:05:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:05:51,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:05:52,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:05:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:05:53,868][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:05:54,526][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:05:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:05:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:05:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:05:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:05:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:05:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:05:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:05:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:06:00,441][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:06:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:06:01,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:06:02,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:06:03,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:06:03,668][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:06:03,669][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:06:04,998][__main__][INFO] - Iteration 453 took 51s (9.32% Gen, 88.11% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 38m 55s. Estimated total time: 14h 23m 36s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 48s. [2026-03-25 21:06:05,002][__main__][INFO] - Starting iteration 453. [2026-03-25 21:06:05,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:06:05,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:06:09,855][__main__][INFO] - Number of regex retries in iteration 453: 0 [2026-03-25 21:06:09,856][__main__][INFO] - agents played in iteration 453 are Alice, Bob [2026-03-25 21:06:10,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:06:10,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:06:10,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:06:10,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:06:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:06:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:06:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:06:13,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:06:13,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:06:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:06:15,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:06:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:06:16,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:06:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:06:17,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:06:18,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:06:19,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:06:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:06:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:06:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:06:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:06:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:06:23,014][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:06:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:06:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:06:24,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:06:25,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:06:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:06:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:06:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:06:28,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:06:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:06:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:06:30,243][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:06:30,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:06:31,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:06:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:06:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:06:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:06:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:06:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:06:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:06:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:06:36,814][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:06:37,471][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:06:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:06:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:06:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:06:40,100][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:06:40,757][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:06:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:06:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:06:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:06:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:06:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:06:44,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:06:45,600][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:06:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:06:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:06:47,572][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:06:48,230][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:06:48,887][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:06:49,544][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:06:50,202][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:06:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:06:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:06:52,174][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:06:52,831][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:06:53,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:06:54,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:06:55,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:06:55,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:06:55,547][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:06:56,770][__main__][INFO] - Iteration 454 took 51s (9.37% Gen, 88.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 37m 13s. Estimated total time: 14h 22m 46s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 23s. [2026-03-25 21:06:56,772][__main__][INFO] - Starting iteration 454. [2026-03-25 21:06:56,776][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:06:56,777][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:07:02,033][__main__][INFO] - Number of regex retries in iteration 454: 0 [2026-03-25 21:07:02,034][__main__][INFO] - agents played in iteration 454 are Alice, Bob [2026-03-25 21:07:03,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:03,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:03,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:07:03,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:07:04,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:07:05,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:07:05,670][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:07:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:07:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:07:07,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:07:08,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:07:08,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:07:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:07:10,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:07:10,926][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:07:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:07:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:07:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:07:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:07:14,212][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:07:14,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:07:15,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:07:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:07:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:07:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:07:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:07:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:07:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:07:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:07:20,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:07:21,440][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:07:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:07:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:07:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:07:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:07:24,725][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:07:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:07:26,040][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:07:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:07:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:07:28,013][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:07:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:07:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:07:29,984][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:07:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:07:31,298][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:07:31,955][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:07:32,613][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:07:33,270][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:07:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:07:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:07:35,540][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:07:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:07:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:07:37,513][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:07:38,171][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:07:38,828][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:07:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:07:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:07:40,800][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:07:41,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:07:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:07:42,771][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:07:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:07:44,086][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:07:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:07:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:07:46,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:07:46,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:07:47,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:07:47,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:07:47,997][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:07:49,251][__main__][INFO] - Iteration 455 took 52s (10.02% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 48m 10s. Estimated total time: 14h 34m 35s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 17s. [2026-03-25 21:07:49,254][__main__][INFO] - Starting iteration 455. [2026-03-25 21:07:49,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:07:49,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:07:54,389][__main__][INFO] - Number of regex retries in iteration 455: 0 [2026-03-25 21:07:54,391][__main__][INFO] - agents played in iteration 455 are Alice, Bob [2026-03-25 21:07:55,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:55,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:55,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:07:55,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:07:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:07:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:07:57,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:07:58,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:07:59,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:07:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:08:00,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:08:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:08:01,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:08:02,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:08:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:08:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:08:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:08:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:08:05,837][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:08:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:08:07,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:08:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:08:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:08:09,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:08:09,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:08:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:08:11,102][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:08:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:08:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:08:13,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:08:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:08:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:08:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:08:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:08:16,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:08:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:08:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:08:18,328][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:08:18,986][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:08:19,644][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:08:20,301][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:08:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:08:21,618][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:08:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:08:22,931][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:08:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:08:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:08:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:08:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:08:26,215][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:08:26,871][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:08:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:08:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:08:29,175][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:08:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:08:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:08:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:08:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:08:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:08:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:08:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:08:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:08:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:08:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:08:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:08:37,059][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:08:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:08:38,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:08:39,177][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:08:40,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:08:40,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:08:40,304][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:08:41,585][__main__][INFO] - Iteration 456 took 52s (9.81% Gen, 87.74% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 44m 50s. Estimated total time: 14h 32m 7s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 3s. [2026-03-25 21:08:41,588][__main__][INFO] - Starting iteration 456. [2026-03-25 21:08:41,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:08:41,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:08:46,660][__main__][INFO] - Number of regex retries in iteration 456: 0 [2026-03-25 21:08:46,661][__main__][INFO] - agents played in iteration 456 are Alice, Bob [2026-03-25 21:08:47,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:08:47,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:08:47,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:08:47,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:08:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:08:48,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:08:49,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:08:50,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:08:50,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:08:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:08:52,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:08:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:08:53,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:08:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:08:54,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:08:55,497][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:08:56,156][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:08:56,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:08:57,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:08:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:08:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:08:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:09:00,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:09:00,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:09:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:09:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:09:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:09:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:09:04,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:09:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:09:05,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:09:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:09:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:09:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:09:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:09:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:09:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:09:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:09:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:09:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:09:11,939][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:09:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:09:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:09:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:09:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:09:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:09:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:09:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:09:17,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:09:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:09:18,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:09:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:09:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:09:20,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:09:21,488][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:09:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:09:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:09:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:09:24,123][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:09:24,782][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:09:25,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:09:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:09:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:09:27,418][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:09:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:09:28,736][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:09:29,394][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:09:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:09:30,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:09:31,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:09:32,752][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:09:32,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:09:32,757][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:09:33,964][__main__][INFO] - Iteration 457 took 52s (9.68% Gen, 88.01% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 44m 44s. Estimated total time: 14h 32m 54s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 27s. [2026-03-25 21:09:33,967][__main__][INFO] - Starting iteration 457. [2026-03-25 21:09:33,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:09:33,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:09:39,005][__main__][INFO] - Number of regex retries in iteration 457: 0 [2026-03-25 21:09:39,007][__main__][INFO] - agents played in iteration 457 are Alice, Bob [2026-03-25 21:09:39,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:09:39,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:09:39,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:09:39,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:09:40,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:09:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:09:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:09:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:09:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:09:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:09:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:09:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:09:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:09:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:09:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:09:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:09:48,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:09:49,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:09:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:09:50,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:09:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:09:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:09:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:09:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:09:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:09:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:09:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:09:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:09:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:09:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:09:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:09:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:09:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:09:59,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:10:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:10:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:10:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:10:02,249][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:10:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:10:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:10:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:10:04,888][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:10:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:10:06,206][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:10:06,865][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:10:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:10:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:10:08,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:10:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:10:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:10:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:10:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:10:12,796][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:10:13,459][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:10:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:10:14,774][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:10:15,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:10:16,094][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:10:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:10:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:10:18,070][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:10:18,733][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:10:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:10:20,048][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:10:20,707][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:10:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:10:22,020][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:10:22,676][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:10:23,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:10:24,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:10:25,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:10:25,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:10:25,593][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:10:26,877][__main__][INFO] - Iteration 458 took 52s (9.52% Gen, 88.05% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 52m 44s. Estimated total time: 14h 41m 47s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 53s. [2026-03-25 21:10:26,880][__main__][INFO] - Starting iteration 458. [2026-03-25 21:10:26,883][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:10:26,884][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:10:31,880][__main__][INFO] - Number of regex retries in iteration 458: 0 [2026-03-25 21:10:31,881][__main__][INFO] - agents played in iteration 458 are Alice, Bob [2026-03-25 21:10:32,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:10:32,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:10:32,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:10:32,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:10:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:10:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:10:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:10:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:10:36,055][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:10:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:10:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:10:38,024][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:10:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:10:39,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:10:39,996][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:10:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:10:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:10:41,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:10:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:10:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:10:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:10:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:10:45,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:10:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:10:46,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:10:47,223][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:10:47,881][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:10:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:10:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:10:49,855][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:10:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:10:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:10:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:10:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:10:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:10:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:10:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:10:55,114][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:10:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:10:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:10:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:10:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:10:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:10:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:10:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:11:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:11:01,030][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:11:01,687][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:11:02,344][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:11:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:11:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:11:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:11:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:11:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:11:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:11:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:11:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:11:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:11:09,355][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:11:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:11:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:11:11,326][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:11:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:11:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:11:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:11:13,955][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:11:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:11:15,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:11:15,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:11:16,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:11:18,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:11:18,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:11:18,083][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:11:19,258][__main__][INFO] - Iteration 459 took 52s (9.54% Gen, 88.21% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 43m 1s. Estimated total time: 14h 32m 56s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 28s. [2026-03-25 21:11:19,261][__main__][INFO] - Starting iteration 459. [2026-03-25 21:11:19,265][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:11:19,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:11:24,216][__main__][INFO] - Number of regex retries in iteration 459: 0 [2026-03-25 21:11:24,217][__main__][INFO] - agents played in iteration 459 are Alice, Bob [2026-03-25 21:11:25,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:11:25,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:11:25,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:11:25,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:11:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:11:26,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:11:27,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:11:27,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:11:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:11:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:11:29,924][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:11:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:11:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:11:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:11:32,551][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:11:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:11:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:11:34,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:11:35,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:11:35,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:11:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:11:37,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:11:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:11:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:11:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:11:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:11:40,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:11:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:11:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:11:42,423][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:11:43,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:11:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:11:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:11:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:11:45,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:11:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:11:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:11:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:11:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:11:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:11:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:11:50,311][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:11:50,968][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:11:51,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:11:52,284][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:11:52,940][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:11:53,598][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:11:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:11:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:11:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:11:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:11:56,889][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:11:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:11:58,539][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:11:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:11:59,855][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:12:00,512][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:12:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:12:01,827][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:12:02,483][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:12:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:12:03,797][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:12:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:12:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:12:05,770][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:12:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:12:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:12:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:12:08,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:12:09,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:12:10,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:12:10,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:12:10,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:12:11,775][__main__][INFO] - Iteration 460 took 52s (9.43% Gen, 87.99% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 44m 23s. Estimated total time: 14h 35m 11s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 31s, 500 more iterations: 7h 17m 35s. [2026-03-25 21:12:11,777][__main__][INFO] - Starting iteration 460. [2026-03-25 21:12:11,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:12:11,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:12:16,772][__main__][INFO] - Number of regex retries in iteration 460: 0 [2026-03-25 21:12:16,774][__main__][INFO] - agents played in iteration 460 are Alice, Bob [2026-03-25 21:12:17,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:12:17,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:12:17,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:12:17,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:12:18,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:12:19,043][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:12:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:12:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:12:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:12:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:12:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:12:22,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:12:23,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:12:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:12:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:12:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:12:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:12:26,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:12:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:12:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:12:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:12:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:12:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:12:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:12:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:12:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:12:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:12:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:12:34,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:12:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:12:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:12:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:12:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:12:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:12:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:12:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:12:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:12:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:12:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:12:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:12:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:12:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:12:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:12:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:12:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:12:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:12:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:12:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:12:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:12:47,952][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:12:48,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:12:49,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:12:50,258][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:12:50,915][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:12:51,572][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:12:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:12:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:12:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:12:54,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:12:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:12:55,513][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:12:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:12:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:12:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:12:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:12:58,799][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:12:59,456][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:13:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:13:00,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:13:01,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:13:02,673][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:13:02,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:13:02,677][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:13:04,184][__main__][INFO] - Iteration 461 took 52s (9.52% Gen, 87.60% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 41m 44s. Estimated total time: 14h 33m 24s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 42s. [2026-03-25 21:13:04,186][__main__][INFO] - Starting iteration 461. [2026-03-25 21:13:04,191][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:13:04,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:13:08,879][__main__][INFO] - Number of regex retries in iteration 461: 0 [2026-03-25 21:13:08,881][__main__][INFO] - agents played in iteration 461 are Alice, Bob [2026-03-25 21:13:09,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:13:09,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:13:09,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:13:09,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:13:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:13:11,071][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:13:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:13:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:13:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:13:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:13:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:13:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:13:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:13:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:13:16,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:13:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:13:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:13:18,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:13:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:13:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:13:20,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:13:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:13:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:13:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:13:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:13:24,214][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:13:24,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:13:25,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:13:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:13:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:13:27,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:13:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:13:28,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:13:29,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:13:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:13:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:13:31,447][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:13:32,106][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:13:32,763][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:13:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:13:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:13:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:13:35,390][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:13:36,047][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:13:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:13:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:13:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:13:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:13:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:13:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:13:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:13:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:13:42,287][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:13:42,944][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:13:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:13:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:13:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:13:45,571][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:13:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:13:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:13:47,543][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:13:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:13:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:13:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:13:50,170][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:13:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:13:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:13:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:13:52,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:13:53,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:13:54,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:13:54,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:13:54,801][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:13:56,252][__main__][INFO] - Iteration 462 took 52s (9.01% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 35m 11s. Estimated total time: 14h 27m 43s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 51s. [2026-03-25 21:13:56,255][__main__][INFO] - Starting iteration 462. [2026-03-25 21:13:56,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:13:56,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:14:01,170][__main__][INFO] - Number of regex retries in iteration 462: 0 [2026-03-25 21:14:01,172][__main__][INFO] - agents played in iteration 462 are Alice, Bob [2026-03-25 21:14:02,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:14:02,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:14:02,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:14:02,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:14:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:14:03,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:14:04,025][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:14:04,682][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:14:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:14:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:14:06,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:14:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:14:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:14:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:14:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:14:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:14:10,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:14:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:14:13,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:14:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:14:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:14:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:14:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:14:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:14:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:14:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:14:20,943][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:14:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:14:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:14:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:14:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:14:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:14:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:14:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:14:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:14:26,862][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:14:27,519][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:14:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:14:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:14:29,491][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:14:30,148][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:14:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:14:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:14:32,124][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:14:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:14:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:14:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:14:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:14:35,408][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:14:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:14:36,722][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:14:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:14:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:14:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:14:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:14:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:14:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:14:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:14:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:14:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:14:43,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:14:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:14:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:14:45,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:14:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:14:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:14:47,608][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:14:48,265][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:14:48,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:14:49,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 21:14:50,805][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:14:50,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:14:50,810][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:14:52,673][__main__][INFO] - Iteration 463 took 56s (8.71% Gen, 87.99% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 46m 46s. Estimated total time: 15h 40m 15s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 7s. [2026-03-25 21:14:52,677][__main__][INFO] - Starting iteration 463. [2026-03-25 21:14:52,693][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:14:52,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:14:54,056][mllm.models.large_language_model_local][WARNING] - Response Awaiting previous play of the other player for the first move. did not match regex: (|), retry 1/1 [2026-03-25 21:15:00,613][__main__][INFO] - Number of regex retries in iteration 463: 1 [2026-03-25 21:15:00,615][__main__][INFO] - agents played in iteration 463 are Alice, Bob [2026-03-25 21:15:01,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:01,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:01,403][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:15:01,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:15:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:15:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:15:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:15:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:15:04,672][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:15:05,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:15:05,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:15:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:15:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:15:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:15:08,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:15:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:15:09,925][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:15:10,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:15:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:15:11,895][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:15:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:15:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:15:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:15:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:15:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:15:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:15:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:15:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:15:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:15:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:15:19,116][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:15:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:15:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:15:21,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:15:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:15:22,401][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:15:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:15:23,714][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:15:24,371][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:15:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:15:25,684][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:15:26,340][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:15:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:15:27,654][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:15:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:15:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:15:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:15:30,280][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:15:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:15:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:15:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:15:32,907][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:15:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:15:34,537][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:15:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:15:35,850][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:15:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:15:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:15:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:15:38,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:15:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:15:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:15:40,449][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:15:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:15:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:15:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:15:43,077][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:15:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:15:44,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:15:45,117][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:15:46,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:15:46,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:15:46,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:15:48,111][__main__][INFO] - Iteration 464 took 55s (14.29% Gen, 82.69% Train). Generation: 7s, Training: 45s. Estimated remaining time: 8h 29m 16s. Estimated total time: 15h 23m 40s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 50s. [2026-03-25 21:15:48,114][__main__][INFO] - Starting iteration 464. [2026-03-25 21:15:48,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:15:48,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:15:53,135][__main__][INFO] - Number of regex retries in iteration 464: 0 [2026-03-25 21:15:53,136][__main__][INFO] - agents played in iteration 464 are Alice, Bob [2026-03-25 21:15:54,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:54,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:54,181][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:15:54,182][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:15:54,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:15:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:15:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:15:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:15:57,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:15:58,140][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:15:58,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:15:59,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:16:00,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:16:00,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:16:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:16:02,079][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:16:02,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:16:03,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:16:04,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:16:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:16:05,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:16:06,019][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:16:06,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:16:07,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:16:07,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:16:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:16:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:16:09,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:16:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:16:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:16:11,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:16:12,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:16:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:16:13,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:16:14,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:16:15,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:16:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:16:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:16:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:16:17,841][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:16:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:16:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:16:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:16:20,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:16:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:16:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:16:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:16:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:16:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:16:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:16:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:16:25,720][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:16:26,635][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:16:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:16:27,949][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:16:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:16:29,262][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:16:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:16:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:16:31,233][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:16:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:16:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:16:33,204][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:16:33,860][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:16:34,517][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:16:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:16:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:16:36,487][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:16:37,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:16:37,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:16:39,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:16:39,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:16:39,116][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:16:40,366][__main__][INFO] - Iteration 465 took 52s (9.60% Gen, 88.00% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 35m 34s. Estimated total time: 14h 30m 50s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 25s. [2026-03-25 21:16:40,368][__main__][INFO] - Starting iteration 465. [2026-03-25 21:16:40,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:16:40,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:16:45,064][__main__][INFO] - Number of regex retries in iteration 465: 0 [2026-03-25 21:16:45,065][__main__][INFO] - agents played in iteration 465 are Alice, Bob [2026-03-25 21:16:45,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:16:45,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:16:45,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:16:45,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:16:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:16:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:16:47,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:16:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:16:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:16:49,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:16:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:16:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:16:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:16:52,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:16:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:16:53,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:16:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:16:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:16:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:16:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:16:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:16:57,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:16:58,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:16:58,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:16:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:17:00,297][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:17:00,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:17:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:17:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:17:02,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:17:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:17:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:17:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:17:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:17:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:17:06,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:17:07,525][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:17:08,181][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:17:08,838][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:17:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:17:10,151][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:17:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:17:11,464][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:17:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:17:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:17:13,434][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:17:14,090][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:17:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:17:15,403][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:17:16,061][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:17:16,718][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:17:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:17:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:17:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:17:19,699][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:17:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:17:21,012][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:17:21,669][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:17:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:17:22,983][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:17:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:17:24,296][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:17:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:17:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:17:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:17:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:17:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:17:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:17:28,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:17:29,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:17:31,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:17:31,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:17:31,027][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:17:32,378][__main__][INFO] - Iteration 466 took 52s (9.02% Gen, 88.37% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 30m 39s. Estimated total time: 14h 26m 47s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 23s. [2026-03-25 21:17:32,381][__main__][INFO] - Starting iteration 466. [2026-03-25 21:17:32,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:17:32,387][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:17:37,314][__main__][INFO] - Number of regex retries in iteration 466: 0 [2026-03-25 21:17:37,315][__main__][INFO] - agents played in iteration 466 are Alice, Bob [2026-03-25 21:17:38,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:17:38,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:17:38,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:17:38,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:17:38,880][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:17:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:17:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:17:40,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:17:41,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:17:42,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:17:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:17:43,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:17:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:17:44,745][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:17:45,403][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:17:46,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:17:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:17:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:17:48,038][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:17:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:17:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:17:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:17:50,671][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:17:51,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:17:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:17:52,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:17:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:17:53,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:17:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:17:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:17:55,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:17:56,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:17:57,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:17:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:17:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:17:59,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:17:59,870][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:18:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:18:01,184][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:18:01,841][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:18:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:18:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:18:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:18:04,469][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:18:05,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:18:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:18:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:18:07,098][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:18:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:18:08,411][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:18:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:18:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:18:10,727][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:18:11,385][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:18:12,042][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:18:12,699][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:18:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:18:14,012][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:18:14,668][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:18:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:18:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:18:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:18:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:18:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:18:18,609][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:18:19,266][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:18:19,923][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:18:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:18:21,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:18:21,951][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:18:23,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:18:23,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:18:23,231][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:18:24,430][__main__][INFO] - Iteration 467 took 52s (9.47% Gen, 88.22% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 30m 26s. Estimated total time: 14h 27m 26s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-25 21:18:24,433][__main__][INFO] - Starting iteration 467. [2026-03-25 21:18:24,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:18:24,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:18:29,236][__main__][INFO] - Number of regex retries in iteration 467: 0 [2026-03-25 21:18:29,237][__main__][INFO] - agents played in iteration 467 are Alice, Bob [2026-03-25 21:18:30,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:18:30,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:18:30,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:18:30,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:18:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:18:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:18:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:18:32,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:18:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:18:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:18:34,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:18:35,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:18:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:18:36,721][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:18:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:18:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:18:38,691][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:18:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:18:40,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:18:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:18:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:18:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:18:42,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:18:43,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:18:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:18:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:18:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:18:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:18:46,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:18:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:18:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:18:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:18:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:18:49,857][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:18:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:18:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:18:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:18:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:18:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:18:53,798][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:18:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:18:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:18:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:18:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:18:57,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:18:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:18:58,396][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:18:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:18:59,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:19:00,366][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:19:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:19:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:19:02,638][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:19:03,295][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:19:03,952][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:19:04,609][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:19:05,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:19:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:19:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:19:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:19:07,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:19:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:19:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:19:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:19:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:19:11,176][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:19:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:19:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:19:13,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:19:13,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:19:15,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:19:15,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:19:15,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:19:16,668][__main__][INFO] - Iteration 468 took 52s (9.19% Gen, 87.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 32m 39s. Estimated total time: 14h 30m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 16s. [2026-03-25 21:19:16,672][__main__][INFO] - Starting iteration 468. [2026-03-25 21:19:16,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:19:16,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:19:21,627][__main__][INFO] - Number of regex retries in iteration 468: 0 [2026-03-25 21:19:21,629][__main__][INFO] - agents played in iteration 468 are Alice, Bob [2026-03-25 21:19:22,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:19:22,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:19:22,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:19:22,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:19:23,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:19:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:19:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:19:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:19:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:19:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:19:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:19:27,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:19:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:19:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:19:29,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:19:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:19:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:19:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:19:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:19:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:19:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:19:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:19:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:19:35,536][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:19:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:19:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:19:37,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:19:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:19:38,819][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:19:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:19:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:19:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:19:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:19:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:19:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:19:43,415][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:19:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:19:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:19:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:19:46,042][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:19:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:19:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:19:48,013][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:19:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:19:49,326][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:19:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:19:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:19:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:19:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:19:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:19:53,266][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:19:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:19:54,898][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:19:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:19:56,212][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:19:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:19:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:19:58,182][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:19:58,838][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:19:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:20:00,151][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:20:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:20:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:20:02,123][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:20:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:20:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:20:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:20:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:20:05,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:20:06,159][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:20:07,275][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:20:07,278][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:20:07,280][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:20:08,645][__main__][INFO] - Iteration 469 took 51s (9.49% Gen, 87.88% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 27m 1s. Estimated total time: 14h 25m 46s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 53s. [2026-03-25 21:20:08,647][__main__][INFO] - Starting iteration 469. [2026-03-25 21:20:08,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:20:08,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:20:13,690][__main__][INFO] - Number of regex retries in iteration 469: 0 [2026-03-25 21:20:13,692][__main__][INFO] - agents played in iteration 469 are Alice, Bob [2026-03-25 21:20:14,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:20:14,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:20:14,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:20:14,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:20:15,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:20:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:20:16,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:20:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:20:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:20:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:20:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:20:19,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:20:20,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:20:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:20:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:20:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:20:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:20:23,670][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:20:24,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:20:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:20:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:20:26,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:20:26,955][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:20:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:20:28,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:20:28,925][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:20:29,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:20:30,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:20:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:20:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:20:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:20:32,867][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:20:33,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:20:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:20:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:20:35,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:20:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:20:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:20:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:20:38,123][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:20:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:20:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:20:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:20:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:20:41,407][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:20:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:20:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:20:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:20:44,035][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:20:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:20:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:20:46,005][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:20:46,971][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:20:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:20:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:20:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:20:49,601][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:20:50,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:20:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:20:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:20:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:20:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:20:53,543][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:20:54,199][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:20:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:20:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:20:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:20:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:20:57,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:20:58,208][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:20:59,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:20:59,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:20:59,314][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:21:00,720][__main__][INFO] - Iteration 470 took 52s (9.68% Gen, 87.62% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 28m 14s. Estimated total time: 14h 27m 50s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 55s. [2026-03-25 21:21:00,723][__main__][INFO] - Starting iteration 470. [2026-03-25 21:21:00,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:21:00,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:21:05,572][__main__][INFO] - Number of regex retries in iteration 470: 0 [2026-03-25 21:21:05,574][__main__][INFO] - agents played in iteration 470 are Alice, Bob [2026-03-25 21:21:06,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:06,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:06,142][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:21:06,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:21:06,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:21:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:21:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:21:08,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:21:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:21:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:21:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:21:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:21:12,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:21:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:21:13,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:21:14,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:21:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:21:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:21:15,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:21:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:21:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:21:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:21:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:21:19,256][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:21:19,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:21:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:21:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:21:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:21:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:21:23,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:21:23,853][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:21:24,510][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:21:25,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:21:25,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:21:26,480][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:21:27,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:21:27,794][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:21:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:21:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:21:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:21:30,420][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:21:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:21:31,734][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:21:32,391][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:21:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:21:33,704][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:21:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:21:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:21:35,674][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:21:36,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:21:36,988][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:21:37,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:21:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:21:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:21:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:21:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:21:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:21:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:21:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:21:43,145][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:21:43,802][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:21:44,459][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:21:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:21:45,773][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:21:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:21:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:21:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:21:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:21:49,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:21:49,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 21:21:50,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:21:50,874][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:21:50,875][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:21:52,169][__main__][INFO] - Iteration 471 took 51s (9.42% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 16m 55s. Estimated total time: 14h 17m 23s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 44s, 500 more iterations: 7h 8m 41s. [2026-03-25 21:21:52,172][__main__][INFO] - Starting iteration 471. [2026-03-25 21:21:52,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:21:52,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:21:56,903][__main__][INFO] - Number of regex retries in iteration 471: 0 [2026-03-25 21:21:56,905][__main__][INFO] - agents played in iteration 471 are Alice, Bob [2026-03-25 21:21:57,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:57,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:57,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:21:57,587][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:21:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:21:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:21:59,556][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:22:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:22:00,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:22:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:22:02,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:22:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:22:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:22:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:22:04,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:22:05,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:22:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:22:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:22:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:22:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:22:08,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:22:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:22:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:22:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:22:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:22:12,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:22:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:22:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:22:14,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:22:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:22:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:22:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:22:16,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:22:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:22:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:22:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:22:19,255][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:22:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:22:20,568][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:22:21,225][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:22:21,881][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:22:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:22:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:22:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:22:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:22:25,165][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:22:25,822][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:22:26,479][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:22:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:22:27,793][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:22:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:22:29,106][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:22:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:22:30,713][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:22:31,369][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:22:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:22:32,683][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:22:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:22:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:22:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:22:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:22:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:22:36,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:22:37,280][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:22:37,937][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:22:38,593][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:22:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:22:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:22:40,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:22:41,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 21:22:42,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:22:42,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:22:42,422][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:22:43,850][__main__][INFO] - Iteration 472 took 51s (9.15% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 19m 57s. Estimated total time: 14h 21m 17s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 7s, 500 more iterations: 7h 10m 38s. [2026-03-25 21:22:43,853][__main__][INFO] - Starting iteration 472. [2026-03-25 21:22:43,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:22:43,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:22:48,690][__main__][INFO] - Number of regex retries in iteration 472: 0 [2026-03-25 21:22:48,691][__main__][INFO] - agents played in iteration 472 are Alice, Bob [2026-03-25 21:22:49,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:22:49,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:22:49,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:22:49,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:22:49,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:22:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:22:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:22:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:22:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:22:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:22:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:22:54,528][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:22:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:22:55,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:22:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:22:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:22:57,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:22:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:22:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:22:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:23:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:23:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:23:01,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:23:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:23:03,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:23:03,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:23:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:23:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:23:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:23:06,352][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:23:07,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:23:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:23:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:23:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:23:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:23:10,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:23:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:23:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:23:12,265][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:23:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:23:13,580][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:23:14,237][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:23:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:23:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:23:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:23:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:23:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:23:18,179][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:23:18,837][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:23:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:23:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:23:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:23:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:23:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:23:23,021][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:23:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:23:24,335][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:23:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:23:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:23:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:23:26,962][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:23:27,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:23:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:23:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:23:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:23:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:23:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:23:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:23:32,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:23:32,949][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:23:34,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:23:34,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:23:34,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:23:35,339][__main__][INFO] - Iteration 473 took 51s (9.39% Gen, 88.27% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 15m 53s. Estimated total time: 14h 18m 4s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 48s, 500 more iterations: 7h 9m 2s. [2026-03-25 21:23:35,342][__main__][INFO] - Starting iteration 473. [2026-03-25 21:23:35,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:23:35,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:23:40,191][__main__][INFO] - Number of regex retries in iteration 473: 0 [2026-03-25 21:23:40,193][__main__][INFO] - agents played in iteration 473 are Alice, Bob [2026-03-25 21:23:40,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:23:40,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:23:40,860][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:23:40,860][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:23:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:23:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:23:42,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:23:43,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:23:44,128][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:23:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:23:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:23:46,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:23:46,756][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:23:47,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:23:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:23:48,727][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:23:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:23:50,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:23:50,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:23:51,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:23:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:23:52,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:23:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:23:53,979][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:23:54,636][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:23:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:23:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:23:56,607][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:23:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:23:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:23:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:23:59,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:23:59,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:24:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:24:01,203][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:24:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:24:02,517][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:24:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:24:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:24:04,487][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:24:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:24:05,801][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:24:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:24:07,114][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:24:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:24:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:24:09,084][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:24:09,741][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:24:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:24:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:24:11,710][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:24:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:24:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:24:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:24:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:24:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:24:15,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:24:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:24:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:24:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:24:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:24:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:24:19,857][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:24:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:24:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:24:21,828][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:24:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:24:23,142][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:24:23,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:24:24,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:24:25,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:24:25,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:24:25,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:24:27,549][__main__][INFO] - Iteration 474 took 52s (9.28% Gen, 87.24% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 27m 1s. Estimated total time: 14h 30m 4s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 2s. [2026-03-25 21:24:27,551][__main__][INFO] - Starting iteration 474. [2026-03-25 21:24:27,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:24:27,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:24:35,823][__main__][INFO] - Number of regex retries in iteration 474: 0 [2026-03-25 21:24:35,825][__main__][INFO] - agents played in iteration 474 are Alice, Bob [2026-03-25 21:24:36,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:36,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:36,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:24:36,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:24:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:24:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:24:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:24:39,042][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:24:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:24:40,353][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:24:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:24:41,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:24:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:24:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:24:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:24:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:24:44,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:24:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:24:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:24:46,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:24:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:24:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:24:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:24:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:24:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:24:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:24:51,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:24:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:24:52,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:24:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:24:54,141][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:24:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:24:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:24:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:24:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:24:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:24:58,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:24:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:24:59,395][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:25:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:25:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:25:01,365][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:25:02,022][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:25:02,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:25:03,335][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:25:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:25:04,648][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:25:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:25:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:25:06,617][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:25:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:25:07,930][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:25:08,832][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:25:09,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:25:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:25:10,803][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:25:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:25:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:25:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:25:13,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:25:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:25:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:25:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:25:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:25:16,716][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:25:17,373][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:25:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:25:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:25:19,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:25:20,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 21:25:21,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:25:21,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:25:21,177][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:25:22,664][__main__][INFO] - Iteration 475 took 55s (14.95% Gen, 82.29% Train). Generation: 8s, Training: 45s. Estimated remaining time: 8h 14m 32s. Estimated total time: 15h 18m 31s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 15s. [2026-03-25 21:25:22,666][__main__][INFO] - Starting iteration 475. [2026-03-25 21:25:22,670][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:25:22,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:25:27,480][__main__][INFO] - Number of regex retries in iteration 475: 0 [2026-03-25 21:25:27,482][__main__][INFO] - agents played in iteration 475 are Alice, Bob [2026-03-25 21:25:28,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:25:28,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:25:28,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:25:28,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:25:28,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:25:29,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:25:30,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:25:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:25:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:25:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:25:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:25:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:25:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:25:34,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:25:35,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:25:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:25:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:25:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:25:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:25:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:25:39,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:25:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:25:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:25:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:25:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:25:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:25:43,251][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:25:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:25:44,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:25:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:25:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:25:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:25:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:25:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:25:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:25:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:25:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:25:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:25:51,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:25:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:25:52,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:25:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:25:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:25:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:25:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:25:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:25:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:25:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:25:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:25:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:25:59,020][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:25:59,677][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:26:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:26:01,258][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:26:01,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:26:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:26:03,228][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:26:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:26:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:26:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:26:05,855][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:26:06,512][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:26:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:26:07,825][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:26:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:26:09,139][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:26:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:26:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:26:11,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:26:11,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 21:26:12,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:26:12,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:26:12,944][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:26:14,125][__main__][INFO] - Iteration 476 took 51s (9.35% Gen, 88.35% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 12m 46s. Estimated total time: 14h 17m 36s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 45s, 500 more iterations: 7h 8m 48s. [2026-03-25 21:26:14,128][__main__][INFO] - Starting iteration 476. [2026-03-25 21:26:14,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:26:14,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:26:19,217][__main__][INFO] - Number of regex retries in iteration 476: 0 [2026-03-25 21:26:19,218][__main__][INFO] - agents played in iteration 476 are Alice, Bob [2026-03-25 21:26:19,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:26:19,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:26:19,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:26:19,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:26:20,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:26:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:26:21,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:26:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:26:23,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:26:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:26:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:26:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:26:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:26:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:26:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:26:27,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:26:28,405][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:26:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:26:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:26:30,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:26:31,033][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:26:31,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:26:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:26:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:26:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:26:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:26:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:26:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:26:36,290][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:26:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:26:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:26:38,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:26:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:26:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:26:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:26:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:26:41,542][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:26:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:26:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:26:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:26:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:26:44,826][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:26:45,482][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:26:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:26:46,795][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:26:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:26:48,109][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:26:48,765][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:26:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:26:50,079][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:26:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:26:51,392][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:26:52,300][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:26:52,957][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:26:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:26:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:26:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:26:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:26:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:26:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:26:57,556][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:26:58,213][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:26:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:26:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:27:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:27:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:27:01,497][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:27:02,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:27:02,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:27:03,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:27:04,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:27:04,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:27:04,741][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:27:05,915][__main__][INFO] - Iteration 477 took 51s (9.82% Gen, 87.91% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 17m 23s. Estimated total time: 14h 23m 5s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 32s. [2026-03-25 21:27:05,918][__main__][INFO] - Starting iteration 477. [2026-03-25 21:27:05,922][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:27:05,922][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:27:10,768][__main__][INFO] - Number of regex retries in iteration 477: 0 [2026-03-25 21:27:10,770][__main__][INFO] - agents played in iteration 477 are Alice, Bob [2026-03-25 21:27:11,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:27:11,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:27:11,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:27:11,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:27:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:27:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:27:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:27:13,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:27:14,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:27:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:27:15,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:27:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:27:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:27:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:27:18,558][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:27:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:27:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:27:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:27:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:27:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:27:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:27:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:27:23,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:27:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:27:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:27:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:27:26,436][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:27:27,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:27:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:27:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:27:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:27:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:27:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:27:31,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:27:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:27:32,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:27:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:27:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:27:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:27:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:27:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:27:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:27:36,940][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:27:37,596][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:27:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:27:38,909][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:27:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:27:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:27:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:27:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:27:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:27:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:27:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:27:44,440][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:27:45,098][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:27:45,756][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:27:46,415][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:27:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:27:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:27:48,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:27:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:27:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:27:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:27:51,019][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:27:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:27:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:27:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:27:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:27:54,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:27:55,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:27:56,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:27:56,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:27:56,306][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:27:57,661][__main__][INFO] - Iteration 478 took 51s (9.37% Gen, 88.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 15m 47s. Estimated total time: 14h 22m 20s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 10s. [2026-03-25 21:27:57,664][__main__][INFO] - Starting iteration 478. [2026-03-25 21:27:57,668][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:27:57,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:28:02,741][__main__][INFO] - Number of regex retries in iteration 478: 0 [2026-03-25 21:28:02,743][__main__][INFO] - agents played in iteration 478 are Alice, Bob [2026-03-25 21:28:03,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:03,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:03,419][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:28:03,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:28:04,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:28:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:28:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:28:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:28:06,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:28:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:28:08,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:28:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:28:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:28:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:28:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:28:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:28:11,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:28:12,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:28:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:28:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:28:14,578][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:28:15,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:28:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:28:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:28:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:28:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:28:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:28:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:28:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:28:20,492][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:28:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:28:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:28:22,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:28:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:28:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:28:24,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:28:25,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:28:25,754][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:28:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:28:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:28:27,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:28:28,384][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:28:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:28:29,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:28:30,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:28:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:28:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:28:32,330][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:28:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:28:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:28:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:28:34,964][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:28:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:28:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:28:37,270][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:28:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:28:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:28:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:28:39,898][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:28:40,555][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:28:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:28:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:28:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:28:43,183][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:28:43,840][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:28:44,496][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:28:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:28:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:28:46,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:28:47,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:28:48,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:28:48,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:28:48,433][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:28:49,733][__main__][INFO] - Iteration 479 took 52s (9.75% Gen, 87.75% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 20m 20s. Estimated total time: 14h 27m 46s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 53s. [2026-03-25 21:28:49,736][__main__][INFO] - Starting iteration 479. [2026-03-25 21:28:49,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:28:49,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:28:54,665][__main__][INFO] - Number of regex retries in iteration 479: 0 [2026-03-25 21:28:54,667][__main__][INFO] - agents played in iteration 479 are Alice, Bob [2026-03-25 21:28:55,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:55,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:55,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:28:55,242][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:28:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:28:56,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:28:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:28:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:28:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:28:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:28:59,829][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:29:00,487][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:29:01,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:29:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:29:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:29:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:29:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:29:04,430][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:29:05,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:29:05,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:29:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:29:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:29:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:29:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:29:09,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:29:09,684][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:29:10,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:29:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:29:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:29:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:29:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:29:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:29:14,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:29:14,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:29:15,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:29:16,252][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:29:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:29:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:29:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:29:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:29:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:29:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:29:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:29:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:29:22,170][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:29:22,827][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:29:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:29:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:29:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:29:25,457][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:29:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:29:26,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:29:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:29:28,383][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:29:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:29:29,699][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:29:30,357][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:29:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:29:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:29:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:29:32,986][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:29:33,642][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:29:34,300][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:29:34,957][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:29:35,615][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:29:36,271][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:29:36,929][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:29:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:29:38,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:29:39,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:29:40,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:29:40,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:29:40,130][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:29:41,416][__main__][INFO] - Iteration 480 took 51s (9.53% Gen, 87.97% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 13m 1s. Estimated total time: 14h 21m 18s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 7s, 500 more iterations: 7h 10m 39s. [2026-03-25 21:29:41,419][__main__][INFO] - Starting iteration 480. [2026-03-25 21:29:41,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:29:41,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:29:46,243][__main__][INFO] - Number of regex retries in iteration 480: 0 [2026-03-25 21:29:46,244][__main__][INFO] - agents played in iteration 480 are Alice, Bob [2026-03-25 21:29:46,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:29:46,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:29:46,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:29:46,925][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:29:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:29:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:29:48,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:29:49,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:29:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:29:50,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:29:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:29:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:29:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:29:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:29:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:29:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:29:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:29:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:29:56,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:29:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:29:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:29:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:29:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:30:00,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:30:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:30:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:30:02,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:30:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:30:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:30:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:30:04,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:30:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:30:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:30:06,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:30:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:30:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:30:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:30:09,276][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:30:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:30:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:30:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:30:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:30:12,563][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:30:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:30:13,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:30:14,533][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:30:15,190][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:30:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:30:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:30:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:30:17,818][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:30:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:30:19,449][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:30:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:30:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:30:21,422][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:30:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:30:22,739][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:30:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:30:24,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:30:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:30:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:30:26,028][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:30:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:30:27,344][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:30:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:30:28,660][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:30:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:30:29,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:30:30,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:30:31,815][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:30:31,818][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:30:31,819][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:30:33,064][__main__][INFO] - Iteration 481 took 51s (9.33% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 11m 33s. Estimated total time: 14h 20m 42s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 4s, 500 more iterations: 7h 10m 21s. [2026-03-25 21:30:33,067][__main__][INFO] - Starting iteration 481. [2026-03-25 21:30:33,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:30:33,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:30:37,806][__main__][INFO] - Number of regex retries in iteration 481: 0 [2026-03-25 21:30:37,808][__main__][INFO] - agents played in iteration 481 are Alice, Bob [2026-03-25 21:30:38,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:30:38,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:30:38,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:30:38,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:30:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:30:39,797][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:30:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:30:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:30:41,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:30:42,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:30:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:30:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:30:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:30:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:30:45,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:30:46,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:30:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:30:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:30:48,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:30:49,008][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:30:49,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:30:50,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:30:50,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:30:51,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:30:52,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:30:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:30:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:30:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:30:54,925][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:30:55,582][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:30:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:30:56,896][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:30:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:30:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:30:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:30:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:31:00,184][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:31:00,841][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:31:01,499][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:31:02,157][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:31:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:31:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:31:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:31:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:31:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:31:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:31:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:31:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:31:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:31:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:31:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:31:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:31:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:31:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:31:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:31:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:31:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:31:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:31:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:31:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:31:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:31:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:31:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:31:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:31:18,833][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:31:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:31:20,148][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:31:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:31:21,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:31:22,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:31:23,276][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:31:23,278][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:31:23,280][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:31:24,700][__main__][INFO] - Iteration 482 took 51s (9.17% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 10m 30s. Estimated total time: 14h 20m 31s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 3s, 500 more iterations: 7h 10m 15s. [2026-03-25 21:31:24,704][__main__][INFO] - Starting iteration 482. [2026-03-25 21:31:24,710][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:31:24,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:31:29,552][__main__][INFO] - Number of regex retries in iteration 482: 0 [2026-03-25 21:31:29,554][__main__][INFO] - agents played in iteration 482 are Alice, Bob [2026-03-25 21:31:30,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:31:30,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:31:30,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:31:30,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:31:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:31:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:31:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:31:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:31:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:31:34,128][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:31:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:31:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:31:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:31:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:31:37,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:31:38,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:31:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:31:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:31:40,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:31:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:31:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:31:42,016][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:31:42,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:31:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:31:43,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:31:44,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:31:45,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:31:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:31:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:31:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:31:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:31:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:31:49,252][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:31:49,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:31:50,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:31:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:31:51,881][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:31:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:31:53,195][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:31:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:31:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:31:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:31:55,823][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:31:56,481][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:31:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:31:57,795][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:31:58,453][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:31:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:31:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:32:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:32:01,083][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:32:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:32:02,720][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:32:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:32:04,036][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:32:04,693][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:32:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:32:06,008][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:32:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:32:07,323][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:32:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:32:08,638][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:32:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:32:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:32:10,609][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:32:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:32:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:32:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:32:13,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:32:13,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:32:14,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:32:14,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:32:14,964][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:32:16,332][__main__][INFO] - Iteration 483 took 51s (9.38% Gen, 87.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 9m 32s. Estimated total time: 14h 20m 24s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 2s, 500 more iterations: 7h 10m 12s. [2026-03-25 21:32:16,335][__main__][INFO] - Starting iteration 483. [2026-03-25 21:32:16,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:32:16,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:32:21,266][__main__][INFO] - Number of regex retries in iteration 483: 0 [2026-03-25 21:32:21,268][__main__][INFO] - agents played in iteration 483 are Alice, Bob [2026-03-25 21:32:21,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:32:21,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:32:21,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:32:21,951][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:32:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:32:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:32:23,916][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:32:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:32:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:32:25,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:32:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:32:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:32:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:32:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:32:29,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:32:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:32:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:32:31,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:32:31,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:32:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:32:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:32:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:32:34,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:32:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:32:35,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:32:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:32:37,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:32:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:32:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:32:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:32:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:32:40,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:32:41,012][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:32:41,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:32:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:32:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:32:43,643][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:32:44,301][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:32:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:32:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:32:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:32:46,929][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:32:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:32:48,243][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:32:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:32:49,556][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:32:50,214][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:32:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:32:51,529][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:32:52,187][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:32:52,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:32:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:32:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:32:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:32:55,717][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:32:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:32:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:32:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:32:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:32:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:32:59,683][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:33:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:33:00,999][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:33:01,658][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:33:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:33:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:33:03,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:33:04,291][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:33:04,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:33:05,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:33:06,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:33:06,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:33:06,878][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:33:08,256][__main__][INFO] - Iteration 484 took 51s (9.49% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 13m 33s. Estimated total time: 14h 25m 18s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 39s. [2026-03-25 21:33:08,259][__main__][INFO] - Starting iteration 484. [2026-03-25 21:33:08,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:33:08,264][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:33:13,317][__main__][INFO] - Number of regex retries in iteration 484: 0 [2026-03-25 21:33:13,318][__main__][INFO] - agents played in iteration 484 are Alice, Bob [2026-03-25 21:33:13,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:33:13,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:33:13,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:33:13,904][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:33:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:33:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:33:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:33:16,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:33:17,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:33:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:33:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:33:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:33:19,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:33:20,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:33:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:33:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:33:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:33:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:33:23,810][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:33:24,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:33:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:33:25,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:33:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:33:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:33:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:33:28,412][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:33:29,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:33:29,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:33:30,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:33:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:33:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:33:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:33:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:33:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:33:34,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:33:34,990][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:33:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:33:36,310][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:33:36,973][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:33:37,630][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:33:38,288][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:33:38,946][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:33:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:33:40,263][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:33:40,921][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:33:41,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:33:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:33:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:33:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:33:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:33:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:33:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:33:46,521][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:33:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:33:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:33:48,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:33:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:33:49,811][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:33:50,469][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:33:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:33:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:33:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:33:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:33:53,784][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:33:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:33:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:33:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:33:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:33:57,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:33:57,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:33:58,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:33:58,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:33:58,941][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:34:00,285][__main__][INFO] - Iteration 485 took 52s (9.72% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 14m 27s. Estimated total time: 14h 27m 3s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 31s. [2026-03-25 21:34:00,288][__main__][INFO] - Starting iteration 485. [2026-03-25 21:34:00,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:34:00,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:34:05,136][__main__][INFO] - Number of regex retries in iteration 485: 0 [2026-03-25 21:34:05,137][__main__][INFO] - agents played in iteration 485 are Alice, Bob [2026-03-25 21:34:05,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:05,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:05,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:34:05,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:34:06,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:34:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:34:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:34:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:34:09,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:34:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:34:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:34:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:34:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:34:12,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:34:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:34:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:34:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:34:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:34:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:34:16,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:34:16,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:34:17,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:34:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:34:18,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:34:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:34:20,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:34:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:34:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:34:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:34:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:34:23,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:34:24,215][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:34:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:34:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:34:26,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:34:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:34:27,501][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:34:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:34:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:34:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:34:30,131][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:34:30,789][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:34:31,447][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:34:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:34:32,763][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:34:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:34:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:34:34,737][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:34:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:34:36,053][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:34:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:34:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:34:38,355][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:34:39,013][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:34:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:34:40,328][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:34:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:34:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:34:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:34:42,959][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:34:43,615][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:34:44,272][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:34:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:34:45,586][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:34:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:34:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:34:47,557][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:34:48,215][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:34:48,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:34:49,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:34:50,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:34:50,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:34:50,710][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:34:51,939][__main__][INFO] - Iteration 486 took 51s (9.39% Gen, 88.23% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 7m 21s. Estimated total time: 14h 20m 48s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 4s, 500 more iterations: 7h 10m 24s. [2026-03-25 21:34:51,941][__main__][INFO] - Starting iteration 486. [2026-03-25 21:34:51,945][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:34:51,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:34:56,737][__main__][INFO] - Number of regex retries in iteration 486: 0 [2026-03-25 21:34:56,739][__main__][INFO] - agents played in iteration 486 are Alice, Bob [2026-03-25 21:34:57,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:57,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:57,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:34:57,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:34:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:34:58,715][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:34:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:35:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:35:00,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:35:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:35:02,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:35:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:35:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:35:03,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:35:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:35:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:35:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:35:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:35:07,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:35:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:35:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:35:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:35:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:35:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:35:11,204][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:35:11,861][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:35:12,519][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:35:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:35:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:35:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:35:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:35:15,808][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:35:16,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:35:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:35:17,782][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:35:18,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:35:19,098][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:35:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:35:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:35:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:35:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:35:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:35:23,042][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:35:23,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:35:24,356][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:35:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:35:25,671][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:35:26,329][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:35:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:35:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:35:28,303][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:35:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:35:29,925][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:35:30,583][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:35:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:35:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:35:32,556][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:35:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:35:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:35:34,530][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:35:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:35:35,844][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:35:36,502][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:35:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:35:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:35:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:35:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:35:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:35:40,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:35:41,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:35:42,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:35:42,254][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:35:42,256][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:35:43,621][__main__][INFO] - Iteration 487 took 51s (9.27% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 6m 57s. Estimated total time: 14h 21m 16s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 7s, 500 more iterations: 7h 10m 38s. [2026-03-25 21:35:43,624][__main__][INFO] - Starting iteration 487. [2026-03-25 21:35:43,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:35:43,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:35:45,709][mllm.models.large_language_model_local][WARNING] - Response user The scores are updated. The next round begins. did not match regex: (|), retry 1/1 [2026-03-25 21:35:48,864][__main__][INFO] - Number of regex retries in iteration 487: 1 [2026-03-25 21:35:48,866][__main__][INFO] - agents played in iteration 487 are Alice, Bob [2026-03-25 21:35:49,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:35:49,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:35:49,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:35:49,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:35:50,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:35:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:35:51,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:35:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:35:52,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:35:53,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:35:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:35:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:35:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:35:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:35:56,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:35:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:35:58,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:35:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:35:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:36:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:36:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:36:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:36:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:36:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:36:03,341][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:36:03,999][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:36:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:36:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:36:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:36:06,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:36:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:36:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:36:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:36:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:36:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:36:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:36:11,241][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:36:11,899][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:36:12,556][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:36:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:36:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:36:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:36:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:36:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:36:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:36:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:36:17,819][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:36:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:36:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:36:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:36:20,446][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:36:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:36:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:36:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:36:23,317][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:36:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:36:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:36:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:36:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:36:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:36:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:36:27,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:36:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:36:29,236][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:36:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:36:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:36:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:36:31,867][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:36:32,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:36:33,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:36:34,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:36:34,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:36:34,246][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:36:35,481][__main__][INFO] - Iteration 488 took 51s (10.10% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 9m 2s. Estimated total time: 14h 24m 14s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 7s. [2026-03-25 21:36:35,483][__main__][INFO] - Starting iteration 488. [2026-03-25 21:36:35,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:36:35,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:36:41,248][__main__][INFO] - Number of regex retries in iteration 488: 0 [2026-03-25 21:36:41,250][__main__][INFO] - agents played in iteration 488 are Alice, Bob [2026-03-25 21:36:41,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:36:41,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:36:41,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:36:41,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:36:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:36:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:36:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:36:44,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:36:45,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:36:45,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:36:46,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:36:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:36:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:36:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:36:49,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:36:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:36:50,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:36:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:36:51,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:36:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:36:53,086][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:36:53,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:36:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:36:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:36:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:36:56,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:36:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:36:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:36:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:36:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:36:59,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:37:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:37:01,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:37:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:37:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:37:03,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:37:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:37:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:37:04,995][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:37:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:37:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:37:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:37:07,626][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:37:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:37:08,942][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:37:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:37:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:37:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:37:11,575][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:37:12,233][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:37:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:37:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:37:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:37:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:37:15,883][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:37:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:37:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:37:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:37:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:37:19,174][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:37:19,831][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:37:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:37:21,146][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:37:21,803][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:37:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:37:23,117][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:37:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:37:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:37:25,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:37:25,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:37:26,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:37:26,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:37:26,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:37:28,245][__main__][INFO] - Iteration 489 took 52s (10.92% Gen, 86.62% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 23m 14s. Estimated total time: 14h 39m 19s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 39s. [2026-03-25 21:37:28,248][__main__][INFO] - Starting iteration 489. [2026-03-25 21:37:28,252][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:37:28,253][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:37:33,083][__main__][INFO] - Number of regex retries in iteration 489: 0 [2026-03-25 21:37:33,085][__main__][INFO] - agents played in iteration 489 are Alice, Bob [2026-03-25 21:37:33,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:37:33,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:37:33,681][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:37:33,681][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:37:34,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:37:34,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:37:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:37:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:37:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:37:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:37:38,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:37:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:37:39,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:37:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:37:40,898][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:37:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:37:42,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:37:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:37:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:37:44,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:37:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:37:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:37:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:37:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:37:47,494][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:37:48,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:37:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:37:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:37:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:37:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:37:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:37:52,123][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:37:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:37:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:37:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:37:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:37:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:37:56,082][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:37:56,740][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:37:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:37:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:37:58,717][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:37:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:38:00,034][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:38:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:38:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:38:02,009][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:38:02,666][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:38:03,324][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:38:03,981][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:38:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:38:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:38:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:38:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:38:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:38:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:38:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:38:09,588][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:38:10,245][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:38:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:38:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:38:12,217][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:38:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:38:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:38:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:38:14,851][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:38:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:38:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:38:16,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:38:17,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:38:18,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:38:18,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:38:18,772][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:38:20,047][__main__][INFO] - Iteration 490 took 51s (9.33% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 6m 20s. Estimated total time: 14h 23m 16s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 19s, 500 more iterations: 7h 11m 38s. [2026-03-25 21:38:20,050][__main__][INFO] - Starting iteration 490. [2026-03-25 21:38:20,054][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:38:20,054][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:38:24,900][__main__][INFO] - Number of regex retries in iteration 490: 0 [2026-03-25 21:38:24,901][__main__][INFO] - agents played in iteration 490 are Alice, Bob [2026-03-25 21:38:25,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:38:25,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:38:25,598][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:38:25,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:38:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:38:26,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:38:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:38:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:38:28,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:38:29,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:38:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:38:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:38:31,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:38:32,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:38:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:38:33,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:38:34,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:38:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:38:35,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:38:36,102][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:38:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:38:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:38:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:38:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:38:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:38:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:38:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:38:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:38:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:38:42,677][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:38:43,335][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:38:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:38:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:38:45,308][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:38:45,965][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:38:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:38:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:38:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:38:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:38:49,249][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:38:49,906][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:38:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:38:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:38:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:38:52,536][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:38:53,194][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:38:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:38:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:38:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:38:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:38:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:38:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:38:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:38:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:38:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:39:00,040][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:39:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:39:01,356][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:39:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:39:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:39:03,331][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:39:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:39:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:39:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:39:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:39:06,617][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:39:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:39:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:39:08,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:39:09,364][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:39:10,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:39:10,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:39:10,577][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:39:11,792][__main__][INFO] - Iteration 491 took 51s (9.37% Gen, 88.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 4m 32s. Estimated total time: 14h 22m 20s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 10s. [2026-03-25 21:39:11,795][__main__][INFO] - Starting iteration 491. [2026-03-25 21:39:11,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:39:11,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:39:16,612][__main__][INFO] - Number of regex retries in iteration 491: 0 [2026-03-25 21:39:16,614][__main__][INFO] - agents played in iteration 491 are Alice, Bob [2026-03-25 21:39:17,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:39:17,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:39:17,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:39:17,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:39:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:39:18,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:39:19,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:39:20,214][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:39:20,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:39:21,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:39:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:39:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:39:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:39:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:39:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:39:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:39:26,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:39:26,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:39:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:39:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:39:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:39:29,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:39:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:39:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:39:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:39:32,051][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:39:32,709][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:39:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:39:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:39:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:39:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:39:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:39:36,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:39:37,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:39:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:39:38,626][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:39:39,283][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:39:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:39:40,599][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:39:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:39:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:39:42,571][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:39:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:39:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:39:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:39:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:39:45,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:39:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:39:47,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:39:47,871][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:39:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:39:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:39:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:39:50,816][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:39:51,476][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:39:52,136][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:39:52,796][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:39:53,457][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:39:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:39:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:39:55,445][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:39:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:39:56,766][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:39:57,429][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:39:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:39:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:39:59,412][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:40:00,071][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:40:00,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:40:01,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:40:02,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:40:02,471][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:40:02,472][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:40:03,892][__main__][INFO] - Iteration 492 took 52s (9.24% Gen, 88.03% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 9m 35s. Estimated total time: 14h 28m 15s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 7s. [2026-03-25 21:40:04,584][__main__][INFO] - Starting iteration 492. [2026-03-25 21:40:04,588][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:40:04,589][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:40:09,498][__main__][INFO] - Number of regex retries in iteration 492: 0 [2026-03-25 21:40:09,500][__main__][INFO] - agents played in iteration 492 are Alice, Bob [2026-03-25 21:40:10,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:40:10,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:40:10,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:40:10,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:40:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:40:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:40:12,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:40:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:40:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:40:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:40:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:40:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:40:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:40:16,813][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:40:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:40:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:40:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:40:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:40:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:40:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:40:21,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:40:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:40:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:40:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:40:24,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:40:24,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:40:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:40:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:40:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:40:27,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:40:28,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:40:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:40:29,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:40:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:40:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:40:31,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:40:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:40:32,609][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:40:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:40:33,923][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:40:34,580][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:40:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:40:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:40:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:40:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:40:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:40:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:40:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:40:39,845][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:40:40,503][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:40:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:40:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:40:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:40:43,475][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:40:44,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:40:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:40:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:40:46,106][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:40:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:40:47,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:40:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:40:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:40:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:40:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:40:50,708][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:40:51,364][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:40:52,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:40:52,680][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:40:53,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:40:54,101][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:40:55,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:40:55,430][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:40:55,431][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:40:56,631][__main__][INFO] - Iteration 493 took 52s (9.44% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 7m 51s. Estimated total time: 14h 27m 23s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 41s. [2026-03-25 21:40:56,634][__main__][INFO] - Starting iteration 493. [2026-03-25 21:40:56,638][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:40:56,638][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:41:01,397][__main__][INFO] - Number of regex retries in iteration 493: 0 [2026-03-25 21:41:01,398][__main__][INFO] - agents played in iteration 493 are Alice, Bob [2026-03-25 21:41:02,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:02,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:02,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:41:02,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:41:02,781][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:41:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:41:04,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:41:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:41:05,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:41:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:41:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:41:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:41:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:41:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:41:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:41:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:41:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:41:11,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:41:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:41:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:41:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:41:13,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:41:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:41:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:41:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:41:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:41:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:41:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:41:18,506][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:41:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:41:19,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:41:20,477][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:41:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:41:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:41:22,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:41:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:41:23,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:41:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:41:25,074][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:41:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:41:26,389][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:41:27,047][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:41:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:41:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:41:29,020][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:41:29,677][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:41:30,335][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:41:30,993][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:41:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:41:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:41:32,968][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:41:33,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:41:34,589][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:41:35,247][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:41:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:41:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:41:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:41:37,880][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:41:38,539][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:41:39,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:41:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:41:40,514][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:41:41,172][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:41:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:41:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:41:43,154][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:41:43,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:41:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:41:45,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:41:45,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:41:47,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:41:47,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:41:47,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:41:48,377][__main__][INFO] - Iteration 494 took 51s (9.20% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 1m 57s. Estimated total time: 14h 22m 21s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 10s. [2026-03-25 21:41:48,380][__main__][INFO] - Starting iteration 494. [2026-03-25 21:41:48,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:41:48,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:41:53,531][__main__][INFO] - Number of regex retries in iteration 494: 0 [2026-03-25 21:41:53,534][__main__][INFO] - agents played in iteration 494 are Alice, Bob [2026-03-25 21:41:54,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:54,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:54,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:41:54,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:41:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:41:55,765][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:41:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:41:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:41:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:41:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:41:59,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:41:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:42:00,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:42:01,040][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:42:01,698][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:42:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:42:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:42:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:42:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:42:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:42:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:42:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:42:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:42:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:42:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:42:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:42:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:42:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:42:10,918][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:42:11,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:42:12,232][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:42:12,889][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:42:13,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:42:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:42:14,862][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:42:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:42:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:42:16,837][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:42:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:42:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:42:18,810][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:42:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:42:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:42:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:42:21,441][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:42:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:42:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:42:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:42:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:42:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:42:25,387][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:42:26,045][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:42:27,041][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:42:27,699][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:42:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:42:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:42:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:42:30,331][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:42:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:42:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:42:32,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:42:32,960][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:42:33,617][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:42:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:42:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:42:35,590][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:42:36,248][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:42:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:42:37,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:42:38,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:42:39,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:42:39,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:42:39,401][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:42:41,784][__main__][INFO] - Iteration 495 took 53s (9.64% Gen, 85.89% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 28m 43s. Estimated total time: 14h 50m 1s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 0s, 500 more iterations: 7h 25m 0s. [2026-03-25 21:42:41,787][__main__][INFO] - Starting iteration 495. [2026-03-25 21:42:41,791][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:42:41,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:42:46,770][__main__][INFO] - Number of regex retries in iteration 495: 0 [2026-03-25 21:42:46,771][__main__][INFO] - agents played in iteration 495 are Alice, Bob [2026-03-25 21:42:47,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:47,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:47,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:42:47,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:42:48,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:42:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:42:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:42:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:42:50,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:42:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:42:52,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:42:52,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:42:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:42:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:42:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:42:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:42:56,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:42:56,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:42:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:42:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:42:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:42:59,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:42:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:43:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:43:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:43:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:43:02,636][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:43:03,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:43:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:43:04,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:43:05,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:43:05,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:43:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:43:07,242][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:43:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:43:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:43:09,213][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:43:09,870][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:43:10,527][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:43:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:43:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:43:12,500][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:43:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:43:13,816][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:43:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:43:15,132][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:43:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:43:16,447][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:43:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:43:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:43:18,421][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:43:19,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:43:20,046][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:43:20,704][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:43:21,361][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:43:22,019][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:43:22,677][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:43:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:43:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:43:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:43:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:43:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:43:26,624][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:43:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:43:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:43:28,594][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:43:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:43:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:43:30,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:43:31,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:43:32,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:43:32,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:43:32,363][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:43:33,556][__main__][INFO] - Iteration 496 took 51s (9.62% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 0m 37s. Estimated total time: 14h 22m 47s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 23s. [2026-03-25 21:43:33,558][__main__][INFO] - Starting iteration 496. [2026-03-25 21:43:33,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:43:33,563][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:43:38,466][__main__][INFO] - Number of regex retries in iteration 496: 0 [2026-03-25 21:43:38,467][__main__][INFO] - agents played in iteration 496 are Alice, Bob [2026-03-25 21:43:39,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:43:39,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:43:39,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:43:39,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:43:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:43:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:43:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:43:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:43:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:43:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:43:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:43:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:43:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:43:45,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:43:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:43:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:43:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:43:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:43:49,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:43:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:43:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:43:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:43:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:43:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:43:53,029][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:43:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:43:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:43:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:43:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:43:56,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:43:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:43:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:43:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:43:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:43:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:44:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:44:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:44:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:44:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:44:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:44:03,557][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:44:04,214][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:44:04,871][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:44:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:44:06,184][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:44:06,840][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:44:07,497][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:44:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:44:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:44:09,470][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:44:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:44:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:44:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:44:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:44:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:44:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:44:14,418][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:44:15,076][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:44:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:44:16,392][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:44:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:44:17,710][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:44:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:44:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:44:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:44:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:44:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:44:21,658][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:44:22,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:44:23,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:44:24,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:44:24,361][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:44:24,364][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:44:25,637][__main__][INFO] - Iteration 497 took 52s (9.42% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 4m 55s. Estimated total time: 14h 27m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 58s. [2026-03-25 21:44:25,640][__main__][INFO] - Starting iteration 497. [2026-03-25 21:44:25,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:44:25,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:44:30,650][__main__][INFO] - Number of regex retries in iteration 497: 0 [2026-03-25 21:44:30,651][__main__][INFO] - agents played in iteration 497 are Alice, Bob [2026-03-25 21:44:31,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:44:31,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:44:31,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:44:31,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:44:32,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:44:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:44:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:44:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:44:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:44:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:44:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:44:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:44:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:44:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:44:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:44:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:44:39,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:44:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:44:41,223][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:44:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:44:42,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:44:43,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:44:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:44:44,512][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:44:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:44:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:44:46,482][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:44:47,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:44:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:44:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:44:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:44:49,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:44:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:44:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:44:51,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:44:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:44:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:44:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:44:54,372][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:44:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:44:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:44:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:44:57,005][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:44:57,663][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:44:58,322][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:44:58,980][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:44:59,638][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:45:00,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:45:00,953][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:45:01,612][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:45:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:45:02,927][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:45:03,916][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:45:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:45:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:45:05,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:45:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:45:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:45:07,877][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:45:08,534][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:45:09,192][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:45:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:45:10,507][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:45:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:45:11,822][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:45:12,480][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:45:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:45:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:45:14,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:45:15,203][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:45:16,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:45:16,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:45:16,285][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:45:17,676][__main__][INFO] - Iteration 498 took 52s (9.62% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 3m 20s. Estimated total time: 14h 27m 14s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 37s. [2026-03-25 21:45:17,679][__main__][INFO] - Starting iteration 498. [2026-03-25 21:45:17,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:45:17,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:45:22,568][__main__][INFO] - Number of regex retries in iteration 498: 0 [2026-03-25 21:45:22,569][__main__][INFO] - agents played in iteration 498 are Alice, Bob [2026-03-25 21:45:23,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:45:23,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:45:23,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:45:23,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:45:24,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:45:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:45:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:45:26,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:45:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:45:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:45:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:45:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:45:29,300][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:45:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:45:30,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:45:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:45:31,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:45:32,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:45:33,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:45:33,914][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:45:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:45:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:45:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:45:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:45:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:45:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:45:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:45:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:45:39,837][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:45:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:45:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:45:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:45:42,473][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:45:43,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:45:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:45:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:45:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:45:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:45:46,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:45:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:45:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:45:48,397][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:45:49,054][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:45:49,711][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:45:50,367][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:45:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:45:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:45:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:45:52,997][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:45:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:45:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:45:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:45:55,973][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:45:56,632][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:45:57,289][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:45:57,948][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:45:58,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:45:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:45:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:46:00,584][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:46:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:46:01,900][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:46:02,559][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:46:03,217][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:46:03,875][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:46:04,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:46:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:46:05,848][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:46:06,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:46:07,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:46:08,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:46:08,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:46:08,480][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:46:09,774][__main__][INFO] - Iteration 499 took 52s (9.38% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 3m 27s. Estimated total time: 14h 28m 13s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 6s. [2026-03-25 21:46:09,777][__main__][INFO] - Starting iteration 499. [2026-03-25 21:46:09,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:46:09,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:46:14,610][__main__][INFO] - Number of regex retries in iteration 499: 0 [2026-03-25 21:46:14,611][__main__][INFO] - agents played in iteration 499 are Alice, Bob [2026-03-25 21:46:15,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:46:15,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:46:15,229][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:46:15,230][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:46:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:46:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:46:17,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:46:17,979][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:46:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:46:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:46:19,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:46:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:46:21,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:46:21,940][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:46:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:46:23,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:46:23,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:46:24,569][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:46:25,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:46:25,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:46:26,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:46:27,197][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:46:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:46:28,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:46:29,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:46:29,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:46:30,480][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:46:31,137][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:46:31,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:46:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:46:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:46:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:46:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:46:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:46:35,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:46:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:46:37,049][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:46:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:46:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:46:39,020][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:46:39,677][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:46:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:46:40,990][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:46:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:46:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:46:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:46:43,616][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:46:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:46:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:46:45,586][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:46:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:46:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:46:47,895][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:46:48,552][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:46:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:46:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:46:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:46:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:46:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:46:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:46:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:46:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:46:54,465][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:46:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:46:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:46:56,437][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:46:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:46:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:46:58,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:46:59,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:47:00,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:47:00,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:47:00,315][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:47:01,595][__main__][INFO] - Iteration 500 took 51s (9.32% Gen, 88.21% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 57m 57s. Estimated total time: 14h 23m 35s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 47s. [2026-03-25 21:47:01,597][__main__][INFO] - Starting iteration 500. [2026-03-25 21:47:01,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:47:01,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:47:06,340][__main__][INFO] - Number of regex retries in iteration 500: 0 [2026-03-25 21:47:06,341][__main__][INFO] - agents played in iteration 500 are Alice, Bob [2026-03-25 21:47:06,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:47:07,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:47:07,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:47:07,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:47:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:47:08,491][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:47:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:47:09,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:47:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:47:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:47:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:47:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:47:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:47:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:47:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:47:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:47:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:47:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:47:17,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:47:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:47:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:47:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:47:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:47:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:47:20,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:47:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:47:22,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:47:22,954][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:47:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:47:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:47:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:47:25,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:47:26,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:47:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:47:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:47:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:47:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:47:29,523][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:47:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:47:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:47:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:47:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:47:32,808][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:47:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:47:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:47:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:47:35,435][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:47:36,092][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:47:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:47:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:47:38,063][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:47:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:47:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:47:40,360][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:47:41,018][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:47:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:47:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:47:42,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:47:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:47:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:47:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:47:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:47:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:47:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:47:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:47:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:47:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:47:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:47:50,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:47:50,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:47:52,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:47:52,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:47:52,165][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:47:55,049][__main__][INFO] - Iteration 501 took 53s (8.87% Gen, 85.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 24m 18s. Estimated total time: 14h 50m 49s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 24s. [2026-03-25 21:47:55,052][__main__][INFO] - Starting iteration 501. [2026-03-25 21:47:55,056][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:47:55,057][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:48:01,179][__main__][INFO] - Number of regex retries in iteration 501: 0 [2026-03-25 21:48:01,180][__main__][INFO] - agents played in iteration 501 are Alice, Bob [2026-03-25 21:48:01,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:01,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:01,769][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:48:01,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:48:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:48:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:48:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:48:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:48:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:48:05,978][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:48:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:48:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:48:07,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:48:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:48:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:48:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:48:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:48:11,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:48:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:48:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:48:13,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:48:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:48:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:48:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:48:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:48:16,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:48:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:48:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:48:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:48:19,270][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:48:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:48:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:48:21,255][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:48:21,918][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:48:22,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:48:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:48:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:48:24,565][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:48:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:48:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:48:26,550][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:48:27,213][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:48:27,873][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:48:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:48:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:48:29,848][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:48:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:48:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:48:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:48:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:48:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:48:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:48:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:48:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:48:36,215][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:48:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:48:37,533][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:48:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:48:38,850][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:48:39,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:48:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:48:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:48:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:48:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:48:42,801][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:48:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:48:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:48:44,776][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:48:45,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:48:46,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:48:47,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:48:47,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:48:47,585][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:48:49,011][__main__][INFO] - Iteration 502 took 53s (11.35% Gen, 86.00% Train). Generation: 6s, Training: 46s. Estimated remaining time: 7h 31m 52s. Estimated total time: 14h 59m 17s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 55s, 500 more iterations: 7h 29m 38s. [2026-03-25 21:48:49,014][__main__][INFO] - Starting iteration 502. [2026-03-25 21:48:49,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:48:49,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:48:54,077][__main__][INFO] - Number of regex retries in iteration 502: 0 [2026-03-25 21:48:54,078][__main__][INFO] - agents played in iteration 502 are Alice, Bob [2026-03-25 21:48:54,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:54,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:54,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:48:54,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:48:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:48:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:48:56,733][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:48:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:48:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:48:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:48:59,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:49:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:49:00,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:49:01,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:49:01,985][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:49:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:49:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:49:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:49:04,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:49:05,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:49:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:49:06,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:49:07,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:49:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:49:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:49:09,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:49:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:49:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:49:11,178][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:49:11,835][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:49:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:49:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:49:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:49:14,462][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:49:15,119][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:49:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:49:16,432][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:49:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:49:17,745][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:49:18,402][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:49:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:49:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:49:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:49:21,029][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:49:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:49:22,342][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:49:22,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:49:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:49:24,312][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:49:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:49:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:49:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:49:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:49:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:49:28,552][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:49:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:49:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:49:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:49:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:49:31,836][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:49:32,493][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:49:33,150][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:49:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:49:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:49:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:49:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:49:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:49:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:49:37,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:49:38,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:49:39,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:49:39,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:49:39,793][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:49:41,194][__main__][INFO] - Iteration 503 took 52s (9.70% Gen, 87.61% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 1m 20s. Estimated total time: 14h 29m 37s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 48s. [2026-03-25 21:49:41,197][__main__][INFO] - Starting iteration 503. [2026-03-25 21:49:41,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:49:41,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:49:46,274][__main__][INFO] - Number of regex retries in iteration 503: 0 [2026-03-25 21:49:46,275][__main__][INFO] - agents played in iteration 503 are Alice, Bob [2026-03-25 21:49:46,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:49:46,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:49:46,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:49:46,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:49:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:49:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:49:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:49:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:49:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:49:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:49:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:49:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:49:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:49:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:49:54,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:49:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:49:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:49:56,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:49:56,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:49:57,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:49:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:49:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:49:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:50:00,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:50:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:50:01,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:50:02,217][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:50:02,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:50:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:50:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:50:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:50:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:50:06,158][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:50:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:50:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:50:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:50:08,786][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:50:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:50:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:50:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:50:11,412][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:50:12,069][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:50:12,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:50:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:50:14,039][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:50:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:50:15,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:50:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:50:16,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:50:17,323][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:50:17,979][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:50:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:50:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:50:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:50:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:50:21,604][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:50:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:50:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:50:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:50:24,232][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:50:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:50:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:50:26,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:50:26,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:50:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:50:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:50:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:50:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:50:30,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:50:30,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:50:32,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:50:32,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:50:32,092][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:50:33,609][__main__][INFO] - Iteration 504 took 52s (9.68% Gen, 87.42% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 4m 20s. Estimated total time: 14h 33m 29s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 44s. [2026-03-25 21:50:33,612][__main__][INFO] - Starting iteration 504. [2026-03-25 21:50:33,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:50:33,617][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:50:38,441][__main__][INFO] - Number of regex retries in iteration 504: 0 [2026-03-25 21:50:38,442][__main__][INFO] - agents played in iteration 504 are Alice, Bob [2026-03-25 21:50:39,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:50:39,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:50:39,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:50:39,111][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:50:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:50:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:50:41,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:50:41,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:50:42,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:50:43,084][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:50:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:50:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:50:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:50:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:50:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:50:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:50:47,680][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:50:48,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:50:48,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:50:49,650][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:50:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:50:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:50:51,620][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:50:52,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:50:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:50:53,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:50:54,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:50:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:50:55,560][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:50:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:50:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:50:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:50:58,188][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:50:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:50:59,502][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:51:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:51:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:51:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:51:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:51:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:51:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:51:04,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:51:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:51:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:51:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:51:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:51:07,384][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:51:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:51:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:51:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:51:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:51:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:51:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:51:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:51:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:51:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:51:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:51:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:51:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:51:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:51:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:51:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:51:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:51:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:51:19,509][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:51:20,166][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:51:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:51:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:51:22,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:51:22,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:51:23,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:51:23,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:51:23,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:51:25,220][__main__][INFO] - Iteration 505 took 51s (9.35% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 50m 4s. Estimated total time: 14h 20m 5s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 0s, 500 more iterations: 7h 10m 2s. [2026-03-25 21:51:25,223][__main__][INFO] - Starting iteration 505. [2026-03-25 21:51:25,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:51:25,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:51:29,983][__main__][INFO] - Number of regex retries in iteration 505: 0 [2026-03-25 21:51:29,985][__main__][INFO] - agents played in iteration 505 are Alice, Bob [2026-03-25 21:51:30,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:51:30,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:51:30,599][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:51:30,600][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:51:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:51:31,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:51:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:51:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:51:33,865][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:51:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:51:35,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:51:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:51:36,492][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:51:37,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:51:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:51:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:51:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:51:39,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:51:40,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:51:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:51:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:51:42,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:51:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:51:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:51:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:51:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:51:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:51:46,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:51:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:51:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:51:48,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:51:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:51:49,625][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:51:50,281][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:51:50,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:51:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:51:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:51:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:51:53,564][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:51:54,220][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:51:54,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:51:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:51:56,192][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:51:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:51:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:51:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:51:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:51:59,514][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:52:00,177][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:52:00,835][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:52:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:52:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:52:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:52:03,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:52:04,485][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:52:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:52:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:52:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:52:07,114][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:52:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:52:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:52:09,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:52:09,741][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:52:10,398][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:52:11,055][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:52:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:52:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:52:13,026][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:52:13,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:52:14,508][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:52:15,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:52:15,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:52:15,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:52:17,172][__main__][INFO] - Iteration 506 took 51s (9.16% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 54m 54s. Estimated total time: 14h 25m 47s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 53s. [2026-03-25 21:52:17,175][__main__][INFO] - Starting iteration 506. [2026-03-25 21:52:17,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:52:17,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:52:22,372][__main__][INFO] - Number of regex retries in iteration 506: 0 [2026-03-25 21:52:22,374][__main__][INFO] - agents played in iteration 506 are Alice, Bob [2026-03-25 21:52:23,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:23,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:23,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:52:23,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:52:23,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:52:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:52:25,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:52:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:52:26,563][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:52:27,219][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:52:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:52:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:52:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:52:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:52:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:52:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:52:31,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:52:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:52:33,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:52:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:52:34,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:52:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:52:35,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:52:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:52:37,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:52:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:52:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:52:39,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:52:39,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:52:40,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:52:41,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:52:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:52:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:52:42,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:52:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:52:44,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:52:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:52:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:52:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:52:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:52:47,577][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:52:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:52:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:52:49,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:52:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:52:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:52:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:52:52,174][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:52:52,830][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:52:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:52:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:52:54,800][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:52:55,789][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:52:56,447][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:52:57,104][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:52:57,760][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:52:58,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:52:59,074][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:52:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:53:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:53:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:53:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:53:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:53:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:53:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:53:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:53:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:53:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:53:06,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:53:07,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:53:08,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:53:08,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:53:08,247][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:53:09,569][__main__][INFO] - Iteration 507 took 52s (9.91% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 1m 26s. Estimated total time: 14h 33m 12s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 36s. [2026-03-25 21:53:09,572][__main__][INFO] - Starting iteration 507. [2026-03-25 21:53:09,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:53:09,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:53:14,533][__main__][INFO] - Number of regex retries in iteration 507: 0 [2026-03-25 21:53:14,534][__main__][INFO] - agents played in iteration 507 are Alice, Bob [2026-03-25 21:53:15,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:53:15,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:53:15,315][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:53:15,315][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:53:15,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:53:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:53:17,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:53:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:53:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:53:19,251][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:53:19,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:53:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:53:21,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:53:21,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:53:22,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:53:23,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:53:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:53:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:53:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:53:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:53:26,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:53:27,144][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:53:27,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:53:28,463][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:53:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:53:29,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:53:30,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:53:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:53:31,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:53:32,434][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:53:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:53:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:53:34,413][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:53:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:53:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:53:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:53:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:53:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:53:38,355][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:53:39,011][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:53:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:53:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:53:40,982][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:53:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:53:42,295][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:53:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:53:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:53:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:53:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:53:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:53:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:53:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:53:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:53:48,481][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:53:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:53:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:53:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:53:51,107][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:53:51,764][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:53:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:53:53,078][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:53:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:53:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:53:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:53:55,707][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:53:56,364][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:53:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:53:57,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:53:58,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:53:59,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:54:00,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:54:00,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:54:00,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:54:01,290][__main__][INFO] - Iteration 508 took 51s (9.59% Gen, 88.15% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 49m 18s. Estimated total time: 14h 21m 55s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 57s. [2026-03-25 21:54:01,293][__main__][INFO] - Starting iteration 508. [2026-03-25 21:54:01,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:54:01,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:54:06,065][__main__][INFO] - Number of regex retries in iteration 508: 0 [2026-03-25 21:54:06,066][__main__][INFO] - agents played in iteration 508 are Alice, Bob [2026-03-25 21:54:06,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:06,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:06,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:54:06,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:54:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:54:08,009][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:54:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:54:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:54:09,978][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:54:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:54:11,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:54:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:54:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:54:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:54:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:54:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:54:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:54:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:54:16,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:54:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:54:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:54:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:54:19,174][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:54:19,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:54:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:54:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:54:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:54:22,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:54:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:54:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:54:24,429][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:54:25,085][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:54:25,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:54:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:54:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:54:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:54:28,370][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:54:29,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:54:29,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:54:30,340][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:54:30,996][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:54:31,653][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:54:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:54:32,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:54:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:54:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:54:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:54:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:54:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:54:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:54:37,563][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:54:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:54:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:54:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:54:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:54:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:54:41,833][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:54:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:54:43,147][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:54:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:54:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:54:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:54:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:54:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:54:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:54:47,744][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:54:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:54:49,057][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:54:49,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:54:50,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:54:51,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:54:51,585][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:54:51,586][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:54:52,826][__main__][INFO] - Iteration 509 took 51s (9.25% Gen, 88.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 45m 21s. Estimated total time: 14h 18m 50s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 53s, 500 more iterations: 7h 9m 25s. [2026-03-25 21:54:52,829][__main__][INFO] - Starting iteration 509. [2026-03-25 21:54:52,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:54:52,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:54:57,874][__main__][INFO] - Number of regex retries in iteration 509: 0 [2026-03-25 21:54:57,876][__main__][INFO] - agents played in iteration 509 are Alice, Bob [2026-03-25 21:54:58,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:58,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:58,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:54:58,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:54:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:55:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:55:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:55:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:55:02,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:55:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:55:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:55:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:55:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:55:05,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:55:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:55:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:55:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:55:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:55:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:55:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:55:09,895][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:55:10,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:55:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:55:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:55:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:55:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:55:13,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:55:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:55:15,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:55:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:55:16,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:55:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:55:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:55:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:55:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:55:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:55:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:55:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:55:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:55:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:55:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:55:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:55:24,354][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:55:25,011][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:55:25,667][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:55:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:55:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:55:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:55:28,294][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:55:28,951][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:55:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:55:30,264][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:55:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:55:31,912][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:55:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:55:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:55:33,884][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:55:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:55:35,197][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:55:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:55:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:55:37,169][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:55:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:55:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:55:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:55:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:55:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:55:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:55:41,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:55:42,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:55:43,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:55:43,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:55:43,773][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:55:44,981][__main__][INFO] - Iteration 510 took 52s (9.67% Gen, 88.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 54m 48s. Estimated total time: 14h 29m 9s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 34s. [2026-03-25 21:55:44,983][__main__][INFO] - Starting iteration 510. [2026-03-25 21:55:44,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:55:44,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:55:49,952][__main__][INFO] - Number of regex retries in iteration 510: 0 [2026-03-25 21:55:49,953][__main__][INFO] - agents played in iteration 510 are Alice, Bob [2026-03-25 21:55:50,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:55:50,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:55:50,609][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:55:50,609][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:55:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:55:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:55:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:55:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:55:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:55:54,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:55:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:55:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:55:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:55:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:55:57,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:55:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:55:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:55:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:56:00,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:56:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:56:01,755][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:56:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:56:03,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:56:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:56:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:56:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:56:05,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:56:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:56:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:56:07,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:56:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:56:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:56:09,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:56:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:56:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:56:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:56:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:56:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:56:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:56:14,235][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:56:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:56:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:56:16,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:56:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:56:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:56:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:56:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:56:19,489][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:56:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:56:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:56:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:56:22,116][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:56:23,109][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:56:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:56:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:56:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:56:25,737][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:56:26,395][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:56:27,051][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:56:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:56:28,365][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:56:29,021][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:56:29,678][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:56:30,335][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:56:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:56:31,648][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:56:32,305][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:56:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:56:33,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:56:34,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:56:39,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:56:39,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:56:39,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:56:40,398][__main__][INFO] - Iteration 511 took 55s (8.96% Gen, 88.62% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 48m 16s. Estimated total time: 15h 23m 32s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 46s. [2026-03-25 21:56:40,401][__main__][INFO] - Starting iteration 511. [2026-03-25 21:56:40,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:56:40,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:56:47,679][__main__][INFO] - Number of regex retries in iteration 511: 0 [2026-03-25 21:56:47,681][__main__][INFO] - agents played in iteration 511 are Alice, Bob [2026-03-25 21:56:48,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:56:48,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:56:48,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:56:48,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:56:49,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:56:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:56:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:56:51,250][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:56:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:56:52,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:56:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:56:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:56:54,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:56:55,186][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:56:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:56:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:56:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:56:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:56:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:56:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:56:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:57:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:57:01,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:57:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:57:02,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:57:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:57:03,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:57:04,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:57:05,041][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:57:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:57:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:57:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:57:07,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:57:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:57:09,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:57:09,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:57:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:57:11,089][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:57:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:57:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:57:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:57:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:57:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:57:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:57:15,687][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:57:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:57:17,000][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:57:17,657][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:57:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:57:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:57:19,628][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:57:20,284][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:57:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:57:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:57:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:57:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:57:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:57:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:57:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:57:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:57:26,509][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:57:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:57:27,822][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:57:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:57:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:57:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:57:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:57:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:57:31,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:57:32,501][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:57:33,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:57:33,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:57:33,713][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:57:35,030][__main__][INFO] - Iteration 512 took 54s (13.32% Gen, 84.27% Train). Generation: 7s, Training: 46s. Estimated remaining time: 7h 34m 14s. Estimated total time: 15h 10m 25s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 2s, 500 more iterations: 7h 35m 12s. [2026-03-25 21:57:35,033][__main__][INFO] - Starting iteration 512. [2026-03-25 21:57:35,036][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:57:35,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:57:39,985][__main__][INFO] - Number of regex retries in iteration 512: 0 [2026-03-25 21:57:39,986][__main__][INFO] - agents played in iteration 512 are Alice, Bob [2026-03-25 21:57:40,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:57:40,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:57:40,725][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:57:40,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:57:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:57:42,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:57:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:57:43,352][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:57:44,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:57:44,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:57:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:57:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:57:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:57:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:57:47,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:57:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:57:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:57:49,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:57:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:57:51,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:57:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:57:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:57:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:57:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:57:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:57:55,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:57:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:57:56,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:57:57,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:57:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:57:58,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:57:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:57:59,765][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:58:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:58:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:58:01,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:58:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:58:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:58:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:58:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:58:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:58:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:58:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:58:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:58:07,642][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:58:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:58:08,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:58:09,611][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:58:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:58:10,924][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:58:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:58:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:58:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:58:13,798][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:58:14,454][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:58:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:58:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:58:16,424][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:58:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:58:17,737][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:58:18,393][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:58:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:58:19,706][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:58:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:58:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:58:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:58:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:58:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:58:23,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:58:24,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 21:58:25,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:58:25,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:58:25,529][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:58:27,054][__main__][INFO] - Iteration 513 took 52s (9.51% Gen, 87.61% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 49m 56s. Estimated total time: 14h 26m 59s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 29s. [2026-03-25 21:58:27,067][__main__][INFO] - Starting iteration 513. [2026-03-25 21:58:27,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:58:27,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:58:31,948][__main__][INFO] - Number of regex retries in iteration 513: 0 [2026-03-25 21:58:31,950][__main__][INFO] - agents played in iteration 513 are Alice, Bob [2026-03-25 21:58:32,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:58:32,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:58:32,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:58:32,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:58:33,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:58:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:58:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:58:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:58:36,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:58:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:58:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:58:38,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:58:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:58:39,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:58:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:58:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:58:41,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:58:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:58:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:58:43,373][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:58:44,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:58:44,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:58:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:58:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:58:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:58:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:58:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:58:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:58:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:58:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:58:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:58:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:58:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:58:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:58:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:58:53,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:58:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:58:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:58:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:58:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:58:57,159][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:58:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:58:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:58:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:58:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:59:00,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:59:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:59:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:59:02,410][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:59:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:59:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:59:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:59:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:59:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:59:06,641][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:59:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:59:07,954][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:59:08,612][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:59:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:59:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:59:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:59:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:59:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:59:12,553][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:59:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:59:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:59:14,523][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:59:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:59:15,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:59:16,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:59:17,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:59:17,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:59:17,682][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:59:18,936][__main__][INFO] - Iteration 514 took 51s (9.39% Gen, 88.18% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 46m 25s. Estimated total time: 14h 24m 20s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 26s, 500 more iterations: 7h 12m 10s. [2026-03-25 21:59:18,939][__main__][INFO] - Starting iteration 514. [2026-03-25 21:59:18,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:59:18,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:59:23,826][__main__][INFO] - Number of regex retries in iteration 514: 0 [2026-03-25 21:59:23,827][__main__][INFO] - agents played in iteration 514 are Alice, Bob [2026-03-25 21:59:24,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:59:24,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:59:24,675][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:59:24,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:59:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:59:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:59:26,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:59:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:59:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:59:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:59:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:59:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:59:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:59:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:59:31,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:59:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:59:33,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:59:33,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:59:34,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:59:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:59:35,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:59:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:59:37,124][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:59:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:59:38,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:59:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:59:39,751][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:59:40,408][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:59:41,065][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:59:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:59:42,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:59:43,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:59:43,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:59:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:59:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:59:45,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:59:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:59:46,977][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:59:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:59:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:59:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:59:49,605][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:59:50,262][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:59:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:59:51,575][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:59:52,233][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:59:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:59:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:59:54,204][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:59:54,861][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:59:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:59:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:59:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:59:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:59:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:59:59,049][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:59:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:00:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:00:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:00:01,677][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:00:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:00:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:00:03,647][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:00:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:00:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:00:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:00:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:00:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:00:07,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:00:08,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:00:09,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:00:09,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:00:09,457][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:00:10,848][__main__][INFO] - Iteration 515 took 51s (9.41% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 46m 20s. Estimated total time: 14h 25m 6s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 33s. [2026-03-25 22:00:10,851][__main__][INFO] - Starting iteration 515. [2026-03-25 22:00:10,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:00:10,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:00:15,863][__main__][INFO] - Number of regex retries in iteration 515: 0 [2026-03-25 22:00:15,864][__main__][INFO] - agents played in iteration 515 are Alice, Bob [2026-03-25 22:00:16,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:16,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:16,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:00:16,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:00:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:00:18,052][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:00:18,709][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:00:19,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:00:20,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:00:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:00:21,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:00:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:00:22,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:00:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:00:23,960][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:00:24,617][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:00:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:00:25,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:00:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:00:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:00:27,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:00:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:00:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:00:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:00:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:00:31,182][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:00:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:00:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:00:33,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:00:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:00:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:00:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:00:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:00:36,435][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:00:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:00:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:00:38,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:00:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:00:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:00:40,374][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:00:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:00:41,687][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:00:42,343][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:00:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:00:43,656][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:00:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:00:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:00:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:00:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:00:46,938][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:00:47,595][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:00:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:00:49,212][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:00:49,869][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:00:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:00:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:00:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:00:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:00:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:00:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:00:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:00:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:00:55,782][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:00:56,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:00:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:00:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:00:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:00:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:00:59,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:01:00,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:01:01,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:01:01,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:01:01,636][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:01:02,904][__main__][INFO] - Iteration 516 took 52s (9.62% Gen, 87.94% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 47m 52s. Estimated total time: 14h 27m 31s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 45s. [2026-03-25 22:01:02,906][__main__][INFO] - Starting iteration 516. [2026-03-25 22:01:02,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:01:02,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:01:07,986][__main__][INFO] - Number of regex retries in iteration 516: 0 [2026-03-25 22:01:07,988][__main__][INFO] - agents played in iteration 516 are Alice, Bob [2026-03-25 22:01:08,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:01:08,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:01:08,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:01:08,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:01:09,275][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:01:09,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:01:10,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:01:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:01:11,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:01:12,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:01:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:01:13,831][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:01:14,488][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:01:15,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:01:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:01:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:01:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:01:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:01:18,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:01:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:01:19,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:01:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:01:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:01:21,709][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:01:22,365][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:01:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:01:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:01:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:01:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:01:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:01:26,307][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:01:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:01:27,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:01:28,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:01:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:01:29,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:01:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:01:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:01:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:01:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:01:32,874][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:01:33,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:01:34,188][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:01:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:01:35,501][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:01:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:01:36,815][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:01:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:01:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:01:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:01:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:01:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:01:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:01:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:01:42,312][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:01:42,969][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:01:43,625][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:01:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:01:44,938][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:01:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:01:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:01:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:01:47,564][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:01:48,221][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:01:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:01:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:01:50,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:01:50,847][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:01:51,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:01:52,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:01:53,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:01:53,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:01:53,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:01:54,762][__main__][INFO] - Iteration 517 took 51s (9.79% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 43m 42s. Estimated total time: 14h 24m 13s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 6s. [2026-03-25 22:01:54,765][__main__][INFO] - Starting iteration 517. [2026-03-25 22:01:54,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:01:54,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:02:00,036][__main__][INFO] - Number of regex retries in iteration 517: 0 [2026-03-25 22:02:00,037][__main__][INFO] - agents played in iteration 517 are Alice, Bob [2026-03-25 22:02:00,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:00,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:00,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:02:00,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:02:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:02:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:02:02,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:02:03,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:02:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:02:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:02:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:02:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:02:06,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:02:07,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:02:08,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:02:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:02:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:02:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:02:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:02:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:02:12,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:02:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:02:13,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:02:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:02:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:02:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:02:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:02:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:02:17,270][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:02:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:02:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:02:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:02:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:02:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:02:21,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:02:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:02:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:02:23,180][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:02:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:02:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:02:25,150][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:02:25,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:02:26,465][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:02:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:02:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:02:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:02:29,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:02:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:02:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:02:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:02:31,719][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:02:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:02:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:02:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:02:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:02:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:02:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:02:36,582][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:02:37,239][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:02:37,896][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:02:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:02:39,209][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:02:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:02:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:02:41,179][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:02:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:02:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:02:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:02:43,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:02:44,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:02:45,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:02:45,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:02:45,671][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:02:46,848][__main__][INFO] - Iteration 518 took 52s (10.12% Gen, 87.62% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 46m 38s. Estimated total time: 14h 28m 0s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 0s. [2026-03-25 22:02:46,850][__main__][INFO] - Starting iteration 518. [2026-03-25 22:02:46,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:02:46,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:02:57,239][__main__][INFO] - Number of regex retries in iteration 518: 0 [2026-03-25 22:02:57,240][__main__][INFO] - agents played in iteration 518 are Alice, Bob [2026-03-25 22:02:58,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:58,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:58,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:02:58,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:02:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:02:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:03:00,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:03:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:03:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:03:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:03:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:03:03,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:03:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:03:04,683][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:03:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:03:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:03:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:03:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:03:07,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:03:08,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:03:09,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:03:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:03:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:03:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:03:11,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:03:12,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:03:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:03:13,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:03:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:03:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:03:15,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:03:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:03:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:03:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:03:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:03:19,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:03:19,782][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:03:20,439][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:03:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:03:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:03:22,408][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:03:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:03:23,721][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:03:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:03:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:03:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:03:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:03:27,006][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:03:27,663][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:03:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:03:28,975][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:03:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:03:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:03:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:03:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:03:32,494][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:03:33,150][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:03:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:03:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:03:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:03:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:03:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:03:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:03:37,747][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:03:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:03:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:03:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:03:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:03:41,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:03:41,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:03:42,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:03:42,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:03:42,936][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:03:44,236][__main__][INFO] - Iteration 519 took 57s (18.10% Gen, 79.63% Train). Generation: 10s, Training: 45s. Estimated remaining time: 8h 14m 3s. Estimated total time: 15h 56m 23s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 38s, 500 more iterations: 7h 58m 11s. [2026-03-25 22:03:44,238][__main__][INFO] - Starting iteration 519. [2026-03-25 22:03:44,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:03:44,242][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:03:49,632][__main__][INFO] - Number of regex retries in iteration 519: 0 [2026-03-25 22:03:49,634][__main__][INFO] - agents played in iteration 519 are Alice, Bob [2026-03-25 22:03:50,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:03:50,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:03:50,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:03:50,514][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:03:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:03:51,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:03:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:03:53,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:03:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:03:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:03:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:03:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:03:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:03:57,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:03:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:03:58,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:03:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:03:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:04:00,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:04:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:04:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:04:02,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:04:02,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:04:03,607][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:04:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:04:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:04:05,577][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:04:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:04:06,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:04:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:04:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:04:08,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:04:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:04:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:04:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:04:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:04:12,142][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:04:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:04:13,455][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:04:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:04:14,769][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:04:15,424][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:04:16,081][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:04:16,738][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:04:17,394][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:04:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:04:18,708][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:04:19,364][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:04:20,020][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:04:20,677][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:04:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:04:21,990][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:04:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:04:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:04:24,232][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:04:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:04:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:04:26,203][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:04:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:04:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:04:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:04:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:04:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:04:30,143][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:04:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:04:31,456][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:04:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:04:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:04:33,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:04:34,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:04:35,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:04:35,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:04:35,425][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:04:36,681][__main__][INFO] - Iteration 520 took 52s (10.28% Gen, 87.32% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 50m 48s. Estimated total time: 14h 34m 1s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 24s, 500 more iterations: 7h 17m 0s. [2026-03-25 22:04:36,684][__main__][INFO] - Starting iteration 520. [2026-03-25 22:04:36,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:04:36,689][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:04:41,784][__main__][INFO] - Number of regex retries in iteration 520: 0 [2026-03-25 22:04:41,786][__main__][INFO] - agents played in iteration 520 are Alice, Bob [2026-03-25 22:04:42,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:04:42,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:04:42,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:04:42,738][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:04:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:04:44,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:04:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:04:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:04:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:04:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:04:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:04:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:04:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:04:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:04:49,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:04:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:04:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:04:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:04:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:04:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:04:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:04:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:04:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:04:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:04:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:04:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:04:57,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:04:58,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:04:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:04:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:05:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:05:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:05:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:05:02,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:05:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:05:03,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:05:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:05:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:05:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:05:06,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:05:07,006][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:05:07,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:05:08,319][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:05:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:05:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:05:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:05:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:05:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:05:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:05:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:05:13,572][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:05:14,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:05:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:05:15,837][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:05:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:05:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:05:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:05:18,466][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:05:19,123][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:05:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:05:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:05:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:05:21,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:05:22,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:05:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:05:23,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:05:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:05:25,036][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:05:25,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:05:26,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:05:27,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:05:27,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:05:27,568][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:05:28,890][__main__][INFO] - Iteration 521 took 52s (9.76% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 45m 58s. Estimated total time: 14h 30m 3s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 1s. [2026-03-25 22:05:28,893][__main__][INFO] - Starting iteration 521. [2026-03-25 22:05:28,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:05:28,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:05:33,738][__main__][INFO] - Number of regex retries in iteration 521: 0 [2026-03-25 22:05:33,739][__main__][INFO] - agents played in iteration 521 are Alice, Bob [2026-03-25 22:05:34,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:05:34,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:05:34,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:05:34,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:05:35,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:05:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:05:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:05:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:05:37,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:05:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:05:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:05:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:05:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:05:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:05:41,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:05:42,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:05:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:05:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:05:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:05:44,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:05:45,611][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:05:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:05:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:05:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:05:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:05:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:05:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:05:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:05:50,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:05:51,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:05:52,180][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:05:52,837][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:05:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:05:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:05:54,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:05:55,464][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:05:56,123][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:05:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:05:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:05:58,094][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:05:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:05:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:06:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:06:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:06:01,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:06:02,036][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:06:02,693][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:06:03,350][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:06:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:06:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:06:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:06:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:06:06,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:06:07,545][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:06:08,202][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:06:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:06:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:06:10,172][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:06:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:06:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:06:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:06:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:06:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:06:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:06:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:06:15,428][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:06:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:06:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:06:17,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:06:18,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:06:19,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:06:19,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:06:19,349][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:06:22,730][__main__][INFO] - Iteration 522 took 53s (8.99% Gen, 84.72% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 12m 16s. Estimated total time: 14h 57m 15s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 43s, 500 more iterations: 7h 28m 37s. [2026-03-25 22:06:22,733][__main__][INFO] - Starting iteration 522. [2026-03-25 22:06:22,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:06:22,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:06:27,691][__main__][INFO] - Number of regex retries in iteration 522: 0 [2026-03-25 22:06:27,692][__main__][INFO] - agents played in iteration 522 are Alice, Bob [2026-03-25 22:06:28,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:06:28,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:06:28,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:06:28,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:06:29,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:06:29,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:06:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:06:31,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:06:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:06:32,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:06:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:06:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:06:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:06:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:06:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:06:36,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:06:37,077][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:06:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:06:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:06:39,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:06:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:06:40,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:06:41,016][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:06:41,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:06:42,329][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:06:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:06:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:06:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:06:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:06:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:06:46,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:06:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:06:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:06:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:06:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:06:49,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:06:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:06:50,866][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:06:51,523][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:06:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:06:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:06:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:06:54,149][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:06:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:06:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:06:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:06:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:06:57,434][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:06:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:06:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:06:59,403][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:07:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:07:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:07:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:07:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:07:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:07:03,653][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:07:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:07:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:07:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:07:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:07:06,936][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:07:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:07:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:07:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:07:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:07:10,219][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:07:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:07:11,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:07:12,258][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:07:13,707][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:07:13,710][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:07:13,711][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:07:15,074][__main__][INFO] - Iteration 523 took 52s (9.47% Gen, 87.93% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 46m 28s. Estimated total time: 14h 32m 19s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 13s, 500 more iterations: 7h 16m 9s. [2026-03-25 22:07:15,076][__main__][INFO] - Starting iteration 523. [2026-03-25 22:07:15,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:07:15,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:07:19,941][__main__][INFO] - Number of regex retries in iteration 523: 0 [2026-03-25 22:07:19,942][__main__][INFO] - agents played in iteration 523 are Alice, Bob [2026-03-25 22:07:20,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:07:20,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:07:20,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:07:20,771][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:07:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:07:22,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:07:22,717][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:07:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:07:24,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:07:24,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:07:25,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:07:26,000][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:07:26,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:07:27,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:07:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:07:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:07:29,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:07:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:07:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:07:31,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:07:31,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:07:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:07:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:07:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:07:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:07:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:07:35,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:07:36,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:07:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:07:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:07:38,478][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:07:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:07:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:07:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:07:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:07:41,762][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:07:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:07:43,075][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:07:43,732][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:07:44,389][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:07:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:07:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:07:46,359][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:07:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:07:47,673][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:07:48,329][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:07:48,986][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:07:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:07:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:07:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:07:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:07:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:07:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:07:53,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:07:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:07:55,137][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:07:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:07:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:07:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:07:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:07:58,423][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:07:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:07:59,737][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:08:00,394][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:08:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:08:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:08:02,365][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:08:03,022][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:08:03,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:08:04,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:08:05,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:08:05,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:08:05,562][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:08:06,908][__main__][INFO] - Iteration 524 took 51s (9.38% Gen, 88.02% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 37m 5s. Estimated total time: 14h 23m 48s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 22s, 500 more iterations: 7h 11m 54s. [2026-03-25 22:08:06,911][__main__][INFO] - Starting iteration 524. [2026-03-25 22:08:06,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:08:06,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:08:11,767][__main__][INFO] - Number of regex retries in iteration 524: 0 [2026-03-25 22:08:11,769][__main__][INFO] - agents played in iteration 524 are Alice, Bob [2026-03-25 22:08:12,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:12,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:12,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:08:12,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:08:13,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:08:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:08:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:08:15,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:08:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:08:16,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:08:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:08:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:08:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:08:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:08:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:08:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:08:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:08:21,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:08:22,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:08:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:08:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:08:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:08:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:08:25,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:08:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:08:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:08:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:08:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:08:28,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:08:29,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:08:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:08:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:08:31,578][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:08:32,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:08:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:08:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:08:34,204][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:08:34,861][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:08:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:08:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:08:36,831][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:08:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:08:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:08:38,800][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:08:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:08:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:08:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:08:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:08:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:08:42,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:08:43,395][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:08:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:08:45,018][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:08:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:08:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:08:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:08:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:08:48,301][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:08:48,958][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:08:49,614][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:08:50,271][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:08:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:08:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:08:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:08:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:08:53,554][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:08:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:08:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:08:55,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:08:56,232][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:08:57,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:08:57,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:08:57,421][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:08:58,625][__main__][INFO] - Iteration 525 took 51s (9.39% Gen, 88.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 34m 17s. Estimated total time: 14h 21m 51s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 55s. [2026-03-25 22:08:58,628][__main__][INFO] - Starting iteration 525. [2026-03-25 22:08:58,631][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:08:58,631][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:09:03,602][__main__][INFO] - Number of regex retries in iteration 525: 0 [2026-03-25 22:09:03,603][__main__][INFO] - agents played in iteration 525 are Alice, Bob [2026-03-25 22:09:04,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:09:04,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:09:04,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:09:04,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:09:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:09:05,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:09:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:09:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:09:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:09:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:09:09,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:09:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:09:10,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:09:11,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:09:11,717][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:09:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:09:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:09:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:09:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:09:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:09:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:09:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:09:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:09:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:09:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:09:19,603][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:09:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:09:24,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:09:25,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:09:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:09:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:09:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:09:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:09:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:09:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:09:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:09:30,507][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:09:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:09:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:09:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:09:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:09:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:09:34,447][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:09:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:09:35,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:09:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:09:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:09:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:09:38,385][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:09:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:09:39,698][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:09:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:09:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:09:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:09:42,571][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:09:43,228][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:09:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:09:44,542][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:09:45,200][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:09:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:09:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:09:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:09:47,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:09:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:09:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:09:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:09:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:09:51,112][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:09:51,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:09:52,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:47 [2026-03-25 22:09:53,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:09:53,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:09:53,807][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:09:54,975][__main__][INFO] - Iteration 526 took 56s (8.82% Gen, 89.10% Train). Generation: 4s, Training: 50s. Estimated remaining time: 7h 50m 34s. Estimated total time: 15h 39m 5s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 32s. [2026-03-25 22:09:54,977][__main__][INFO] - Starting iteration 526. [2026-03-25 22:09:54,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:09:54,982][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:10:03,048][__main__][INFO] - Number of regex retries in iteration 526: 0 [2026-03-25 22:10:03,050][__main__][INFO] - agents played in iteration 526 are Alice, Bob [2026-03-25 22:10:03,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:03,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:03,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:10:03,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:10:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:10:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:10:05,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:10:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:10:07,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:10:07,778][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:10:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:10:09,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:10:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:10:10,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:10:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:10:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:10:12,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:10:13,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:10:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:10:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:10:14,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:10:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:10:16,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:10:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:10:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:10:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:10:18,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:10:19,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:10:20,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:10:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:10:21,566][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:10:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:10:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:10:23,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:10:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:10:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:10:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:10:26,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:10:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:10:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:10:28,137][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:10:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:10:29,450][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:10:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:10:30,764][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:10:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:10:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:10:32,734][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:10:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:10:34,048][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:10:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:10:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:10:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:10:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:10:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:10:38,299][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:10:38,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:10:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:10:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:10:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:10:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:10:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:10:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:10:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:10:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:10:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:10:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:10:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:10:46,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:10:47,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:10:48,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:10:48,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:10:48,743][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:10:50,076][__main__][INFO] - Iteration 527 took 55s (14.64% Gen, 82.93% Train). Generation: 8s, Training: 45s. Estimated remaining time: 7h 28m 50s. Estimated total time: 15h 18m 16s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 8s. [2026-03-25 22:10:50,079][__main__][INFO] - Starting iteration 527. [2026-03-25 22:10:50,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:10:50,083][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:10:54,891][__main__][INFO] - Number of regex retries in iteration 527: 0 [2026-03-25 22:10:54,892][__main__][INFO] - agents played in iteration 527 are Alice, Bob [2026-03-25 22:10:55,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:55,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:55,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:10:55,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:10:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:10:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:10:57,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:10:58,264][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:10:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:10:59,576][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:11:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:11:00,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:11:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:11:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:11:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:11:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:11:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:11:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:11:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:11:06,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:11:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:11:07,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:11:08,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:11:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:11:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:11:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:11:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:11:11,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:11:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:11:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:11:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:11:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:11:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:11:15,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:11:15,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:11:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:11:17,304][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:11:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:11:18,617][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:11:19,274][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:11:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:11:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:11:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:11:21,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:11:22,557][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:11:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:11:23,871][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:11:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:11:25,184][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:11:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:11:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:11:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:11:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:11:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:11:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:11:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:11:30,674][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:11:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:11:31,988][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:11:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:11:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:11:33,959][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:11:34,616][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:11:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:11:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:11:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:11:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:11:37,901][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:11:38,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:11:39,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:11:40,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:11:40,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:11:40,387][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:11:41,871][__main__][INFO] - Iteration 528 took 51s (9.28% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 32m 51s. Estimated total time: 14h 23m 9s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 34s. [2026-03-25 22:11:41,874][__main__][INFO] - Starting iteration 528. [2026-03-25 22:11:41,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:11:41,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:11:46,737][__main__][INFO] - Number of regex retries in iteration 528: 0 [2026-03-25 22:11:46,738][__main__][INFO] - agents played in iteration 528 are Alice, Bob [2026-03-25 22:11:47,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:11:47,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:11:47,682][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:11:47,683][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:11:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:11:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:11:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:11:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:11:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:11:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:11:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:11:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:11:53,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:11:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:11:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:11:55,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:11:56,197][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:11:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:11:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:11:58,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:11:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:11:59,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:12:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:12:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:12:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:12:02,106][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:12:02,763][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:12:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:12:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:12:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:12:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:12:06,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:12:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:12:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:12:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:12:08,670][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:12:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:12:09,983][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:12:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:12:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:12:11,954][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:12:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:12:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:12:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:12:14,580][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:12:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:12:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:12:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:12:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:12:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:12:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:12:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:12:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:12:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:12:21,432][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:12:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:12:22,747][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:12:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:12:24,061][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:12:24,719][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:12:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:12:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:12:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:12:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:12:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:12:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:12:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:12:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:12:30,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:12:31,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:12:32,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:12:32,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:12:32,789][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:12:34,012][__main__][INFO] - Iteration 529 took 52s (9.32% Gen, 88.33% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 37m 45s. Estimated total time: 14h 28m 55s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 27s. [2026-03-25 22:12:34,014][__main__][INFO] - Starting iteration 529. [2026-03-25 22:12:34,018][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:12:34,019][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:12:38,977][__main__][INFO] - Number of regex retries in iteration 529: 0 [2026-03-25 22:12:38,978][__main__][INFO] - agents played in iteration 529 are Alice, Bob [2026-03-25 22:12:39,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:12:39,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:12:39,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:12:39,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:12:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:12:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:12:41,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:12:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:12:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:12:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:12:44,215][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:12:44,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:12:45,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:12:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:12:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:12:47,499][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:12:48,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:12:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:12:49,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:12:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:12:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:12:51,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:12:52,145][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:12:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:12:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:12:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:12:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:12:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:12:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:12:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:12:57,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:12:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:12:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:12:59,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:13:00,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:13:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:13:01,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:13:02,062][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:13:02,724][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:13:03,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:13:04,041][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:13:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:13:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:13:06,013][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:13:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:13:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:13:07,984][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:13:08,641][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:13:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:13:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:13:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:13:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:13:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:13:12,927][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:13:13,584][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:13:14,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:13:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:13:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:13:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:13:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:13:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:13:18,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:13:18,841][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:13:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:13:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:13:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:13:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:13:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:13:22,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:13:23,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:13:24,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:13:24,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:13:24,777][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:13:26,101][__main__][INFO] - Iteration 530 took 52s (9.52% Gen, 87.93% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 36m 2s. Estimated total time: 14h 28m 4s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 2s. [2026-03-25 22:13:26,105][__main__][INFO] - Starting iteration 530. [2026-03-25 22:13:26,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:13:26,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:13:31,139][__main__][INFO] - Number of regex retries in iteration 530: 0 [2026-03-25 22:13:31,141][__main__][INFO] - agents played in iteration 530 are Alice, Bob [2026-03-25 22:13:31,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:13:31,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:13:31,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:13:31,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:13:32,663][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:13:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:13:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:13:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:13:35,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:13:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:13:36,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:13:37,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:13:37,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:13:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:13:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:13:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:13:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:13:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:13:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:13:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:13:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:13:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:13:44,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:13:45,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:13:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:13:46,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:13:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:13:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:13:48,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:13:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:13:49,700][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:13:50,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:13:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:13:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:13:52,329][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:13:52,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:13:53,643][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:13:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:13:54,959][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:13:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:13:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:13:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:13:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:13:58,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:13:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:13:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:14:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:14:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:14:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:14:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:14:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:14:03,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:14:04,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:14:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:14:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:14:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:14:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:14:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:14:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:14:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:14:09,728][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:14:10,386][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:14:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:14:11,700][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:14:12,358][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:14:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:14:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:14:14,329][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:14:14,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:14:15,710][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:14:16,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:14:16,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:14:16,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:14:18,154][__main__][INFO] - Iteration 531 took 52s (9.66% Gen, 87.79% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 34m 32s. Estimated total time: 14h 27m 26s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-25 22:14:18,157][__main__][INFO] - Starting iteration 531. [2026-03-25 22:14:18,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:14:18,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:14:23,194][__main__][INFO] - Number of regex retries in iteration 531: 0 [2026-03-25 22:14:23,195][__main__][INFO] - agents played in iteration 531 are Alice, Bob [2026-03-25 22:14:24,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:14:24,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:14:24,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:14:24,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:14:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:14:25,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:14:26,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:14:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:14:27,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:14:28,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:14:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:14:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:14:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:14:30,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:14:31,484][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:14:32,140][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:14:32,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:14:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:14:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:14:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:14:35,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:14:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:14:36,740][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:14:37,397][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:14:38,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:14:38,711][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:14:39,368][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:14:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:14:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:14:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:14:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:14:42,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:14:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:14:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:14:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:14:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:14:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:14:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:14:47,248][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:14:47,905][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:14:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:14:49,218][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:14:49,875][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:14:50,532][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:14:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:14:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:14:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:14:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:14:53,816][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:14:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:14:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:14:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:14:56,698][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:14:57,355][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:14:58,012][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:14:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:14:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:14:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:15:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:15:01,298][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:15:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:15:02,612][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:15:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:15:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:07,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:15:08,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:15:09,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:15:09,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:15:09,206][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:15:10,542][__main__][INFO] - Iteration 532 took 52s (9.61% Gen, 87.84% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 39m 17s. Estimated total time: 14h 33m 3s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 31s. [2026-03-25 22:15:10,546][__main__][INFO] - Starting iteration 532. [2026-03-25 22:15:10,558][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:15:10,559][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:15:15,361][__main__][INFO] - Number of regex retries in iteration 532: 0 [2026-03-25 22:15:15,363][__main__][INFO] - agents played in iteration 532 are Alice, Bob [2026-03-25 22:15:15,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:15:16,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:15:16,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:15:16,037][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:15:16,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:15:17,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:15:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:15:18,698][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:15:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:15:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:15:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:15:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:15:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:15:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:15:23,295][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:15:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:15:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:15:25,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:15:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:15:26,579][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:15:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:15:27,893][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:15:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:15:29,207][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:15:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:15:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:15:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:15:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:15:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:15:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:15:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:15:34,462][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:15:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:15:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:15:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:15:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:15:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:15:38,405][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:15:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:15:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:15:40,376][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:15:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:15:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:15:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:15:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:15:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:15:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:15:44,976][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:15:45,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:15:46,291][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:15:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:15:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:15:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:15:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:15:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:15:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:15:51,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:15:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:15:52,473][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:15:53,131][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:15:53,788][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:15:54,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:15:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:15:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:57,733][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:59,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:15:59,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:16:00,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:00,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:00,879][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:16:02,217][__main__][INFO] - Iteration 533 took 51s (9.30% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 26m 22s. Estimated total time: 14h 21m 0s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 30s. [2026-03-25 22:16:02,220][__main__][INFO] - Starting iteration 533. [2026-03-25 22:16:02,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:16:02,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:16:07,114][__main__][INFO] - Number of regex retries in iteration 533: 0 [2026-03-25 22:16:07,116][__main__][INFO] - agents played in iteration 533 are Alice, Bob [2026-03-25 22:16:07,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:07,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:07,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:16:07,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:16:08,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:16:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:16:09,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:16:10,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:16:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:16:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:16:12,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:16:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:16:13,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:16:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:16:15,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:16:15,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:16:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:16:17,194][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:16:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:16:18,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:16:19,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:16:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:16:20,477][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:16:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:16:21,791][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:16:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:16:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:16:23,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:16:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:16:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:16:25,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:16:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:16:27,043][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:16:27,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:16:28,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:16:29,012][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:16:29,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:16:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:16:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:16:31,639][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:16:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:16:32,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:16:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:16:34,266][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:16:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:16:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:16:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:16:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:16:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:16:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:16:38,863][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:16:39,519][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:16:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:16:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:16:41,769][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:16:42,426][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:16:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:16:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:16:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:16:45,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:16:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:16:46,368][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:16:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:16:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:16:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:16:48,995][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:16:49,651][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:16:50,308][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:16:50,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:16:51,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:16:52,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:52,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:52,885][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:16:54,116][__main__][INFO] - Iteration 534 took 51s (9.43% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 29m 24s. Estimated total time: 14h 24m 54s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 27s. [2026-03-25 22:16:54,119][__main__][INFO] - Starting iteration 534. [2026-03-25 22:16:54,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:16:54,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:16:59,001][__main__][INFO] - Number of regex retries in iteration 534: 0 [2026-03-25 22:16:59,002][__main__][INFO] - agents played in iteration 534 are Alice, Bob [2026-03-25 22:16:59,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:59,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:59,955][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:16:59,955][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:17:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:17:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:17:01,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:17:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:17:03,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:17:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:17:04,579][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:17:05,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:17:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:17:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:17:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:17:07,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:17:08,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:17:09,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:17:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:17:10,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:17:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:17:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:17:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:17:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:17:13,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:17:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:17:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:17:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:17:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:17:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:17:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:17:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:17:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:17:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:17:20,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:17:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:17:21,652][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:17:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:17:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:17:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:17:24,279][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:17:24,935][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:17:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:17:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:17:26,909][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:17:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:17:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:17:28,879][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:17:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:17:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:17:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:17:31,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:17:32,416][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:17:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:17:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:17:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:17:35,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:17:35,700][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:17:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:17:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:17:37,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:17:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:17:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:17:39,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:17:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:17:40,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:17:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:17:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:17:42,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:17:43,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:17:44,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:17:44,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:17:44,895][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:17:46,118][__main__][INFO] - Iteration 535 took 51s (9.38% Gen, 88.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 30m 14s. Estimated total time: 14h 26m 36s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 18s. [2026-03-25 22:17:46,120][__main__][INFO] - Starting iteration 535. [2026-03-25 22:17:46,124][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:17:46,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:17:51,314][__main__][INFO] - Number of regex retries in iteration 535: 0 [2026-03-25 22:17:51,315][__main__][INFO] - agents played in iteration 535 are Alice, Bob [2026-03-25 22:17:52,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:17:52,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:17:52,314][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:17:52,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:17:53,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:17:53,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:17:54,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:17:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:17:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:17:56,278][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:17:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:17:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:17:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:17:58,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:17:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:18:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:18:00,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:18:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:18:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:18:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:18:03,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:18:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:18:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:18:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:18:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:18:06,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:18:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:18:08,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:18:08,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:18:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:18:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:18:10,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:18:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:18:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:18:12,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:18:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:18:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:18:14,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:18:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:18:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:18:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:18:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:18:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:18:18,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:18:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:18:19,922][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:18:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:18:21,236][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:18:21,892][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:18:22,550][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:18:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:18:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:18:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:18:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:18:26,103][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:18:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:18:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:18:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:18:28,732][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:18:29,389][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:18:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:18:30,703][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:18:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:18:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:18:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:18:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:18:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:18:34,647][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:18:35,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:18:36,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 22:18:37,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:18:37,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:18:37,150][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:18:38,375][__main__][INFO] - Iteration 536 took 52s (9.93% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 33m 38s. Estimated total time: 14h 30m 52s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 26s. [2026-03-25 22:18:38,378][__main__][INFO] - Starting iteration 536. [2026-03-25 22:18:38,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:18:38,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:18:44,273][__main__][INFO] - Number of regex retries in iteration 536: 0 [2026-03-25 22:18:44,275][__main__][INFO] - agents played in iteration 536 are Alice, Bob [2026-03-25 22:18:45,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:18:45,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:18:45,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:18:45,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:18:45,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:18:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:18:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:18:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:18:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:18:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:18:49,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:18:50,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:18:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:18:51,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:18:52,424][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:18:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:18:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:18:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:18:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:18:55,711][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:18:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:18:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:18:57,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:18:58,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:18:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:18:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:19:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:19:00,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:19:01,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:19:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:19:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:19:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:19:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:19:04,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:19:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:19:06,221][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:19:06,878][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:19:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:19:08,192][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:19:08,848][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:19:09,505][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:19:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:19:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:19:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:19:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:19:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:19:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:19:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:19:14,760][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:19:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:19:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:19:16,730][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:19:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:19:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:19:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:19:19,690][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:19:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:19:21,005][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:19:21,663][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:19:22,320][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:19:22,978][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:19:23,634][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:19:24,292][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:19:24,949][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:19:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:19:26,266][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:19:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:19:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:19:28,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:19:28,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:19:30,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:19:30,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:19:30,319][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:19:31,544][__main__][INFO] - Iteration 537 took 53s (11.08% Gen, 86.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 47m 56s. Estimated total time: 14h 46m 3s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 36s, 500 more iterations: 7h 23m 1s. [2026-03-25 22:19:31,547][__main__][INFO] - Starting iteration 537. [2026-03-25 22:19:31,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:19:31,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:19:36,592][__main__][INFO] - Number of regex retries in iteration 537: 0 [2026-03-25 22:19:36,593][__main__][INFO] - agents played in iteration 537 are Alice, Bob [2026-03-25 22:19:37,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:19:37,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:19:37,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:19:37,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:19:38,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:19:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:19:39,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:19:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:19:40,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:19:41,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:19:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:19:42,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:19:43,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:19:43,949][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:19:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:19:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:19:45,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:19:46,576][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:19:47,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:19:47,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:19:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:19:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:19:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:19:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:19:51,175][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:19:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:19:52,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:19:53,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:19:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:19:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:19:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:19:55,778][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:19:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:19:57,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:19:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:19:58,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:19:59,066][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:19:59,722][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:20:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:20:01,036][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:20:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:20:02,350][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:20:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:20:03,664][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:20:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:20:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:20:05,656][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:20:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:20:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:20:07,627][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:20:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:20:08,940][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:20:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:20:10,619][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:20:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:20:11,932][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:20:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:20:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:20:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:20:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:20:15,216][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:20:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:20:16,530][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:20:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:20:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:20:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:20:19,157][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:20:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:20:20,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:20:21,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:20:22,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:20:22,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:20:22,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:20:23,760][__main__][INFO] - Iteration 538 took 52s (9.66% Gen, 87.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 31m 11s. Estimated total time: 14h 30m 11s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 5s. [2026-03-25 22:20:23,765][__main__][INFO] - Starting iteration 538. [2026-03-25 22:20:23,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:20:23,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:20:29,168][__main__][INFO] - Number of regex retries in iteration 538: 0 [2026-03-25 22:20:29,169][__main__][INFO] - agents played in iteration 538 are Alice, Bob [2026-03-25 22:20:30,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:20:30,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:20:30,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:20:30,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:20:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:20:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:20:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:20:32,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:20:33,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:20:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:20:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:20:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:20:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:20:36,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:20:37,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:20:37,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:20:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:20:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:20:39,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:20:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:20:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:20:41,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:20:42,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:20:43,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:20:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:20:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:20:45,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:20:45,837][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:20:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:20:47,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:20:47,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:20:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:20:49,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:20:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:20:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:20:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:20:51,749][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:20:52,405][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:20:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:20:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:20:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:20:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:20:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:20:56,346][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:20:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:20:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:20:58,316][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:20:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:20:59,629][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:21:00,286][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:21:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:21:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:21:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:21:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:21:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:21:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:21:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:21:05,836][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:21:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:21:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:21:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:21:08,466][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:21:09,123][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:21:09,780][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:21:10,437][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:21:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:21:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:21:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:21:13,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:21:13,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:21:14,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:21:14,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:21:14,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:21:16,208][__main__][INFO] - Iteration 539 took 52s (10.29% Gen, 87.30% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 34m 7s. Estimated total time: 14h 33m 59s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 59s. [2026-03-25 22:21:16,211][__main__][INFO] - Starting iteration 539. [2026-03-25 22:21:16,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:21:16,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:21:21,102][__main__][INFO] - Number of regex retries in iteration 539: 0 [2026-03-25 22:21:21,104][__main__][INFO] - agents played in iteration 539 are Alice, Bob [2026-03-25 22:21:21,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:21:21,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:21:21,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:21:21,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:21:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:21:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:21:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:21:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:21:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:21:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:21:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:21:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:21:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:21:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:21:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:21:29,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:21:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:21:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:21:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:21:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:21:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:21:33,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:21:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:21:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:21:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:21:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:21:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:21:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:21:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:21:39,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:21:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:21:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:21:41,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:21:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:21:42,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:21:42,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:21:43,656][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:21:44,313][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:21:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:21:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:21:46,284][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:21:46,941][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:21:47,599][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:21:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:21:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:21:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:21:50,226][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:21:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:21:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:21:52,198][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:21:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:21:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:21:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:21:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:21:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:21:56,387][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:21:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:21:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:21:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:21:59,014][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:21:59,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:22:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:22:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:22:01,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:22:02,299][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:22:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:22:03,613][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:22:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:22:04,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:22:05,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:22:06,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:22:06,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:22:06,892][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:22:08,133][__main__][INFO] - Iteration 540 took 51s (9.42% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 24m 36s. Estimated total time: 14h 25m 20s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 40s. [2026-03-25 22:22:08,136][__main__][INFO] - Starting iteration 540. [2026-03-25 22:22:08,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:22:08,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:22:18,118][__main__][INFO] - Number of regex retries in iteration 540: 0 [2026-03-25 22:22:18,119][__main__][INFO] - agents played in iteration 540 are Alice, Bob [2026-03-25 22:22:18,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:22:18,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:22:18,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:22:18,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:22:19,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:22:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:22:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:22:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:22:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:22:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:22:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:22:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:22:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:22:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:22:26,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:22:26,723][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:22:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:22:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:22:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:22:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:22:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:22:30,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:22:31,320][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:22:31,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:22:32,635][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:22:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:22:33,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:22:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:22:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:22:35,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:22:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:22:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:22:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:22:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:22:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:22:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:22:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:22:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:22:41,833][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:22:42,491][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:22:43,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:22:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:22:44,461][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:22:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:22:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:22:46,432][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:22:47,089][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:22:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:22:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:22:49,060][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:22:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:22:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:22:51,306][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:22:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:22:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:22:53,277][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:22:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:22:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:22:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:22:55,907][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:22:56,565][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:22:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:22:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:22:58,536][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:22:59,194][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:22:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:23:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:23:01,166][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:23:01,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:23:02,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:23:03,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:23:03,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:23:03,720][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:23:04,957][__main__][INFO] - Iteration 541 took 56s (17.56% Gen, 80.26% Train). Generation: 9s, Training: 45s. Estimated remaining time: 7h 45m 18s. Estimated total time: 15h 46m 59s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 29s. [2026-03-25 22:23:04,960][__main__][INFO] - Starting iteration 541. [2026-03-25 22:23:04,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:23:04,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:23:10,066][__main__][INFO] - Number of regex retries in iteration 541: 0 [2026-03-25 22:23:10,067][__main__][INFO] - agents played in iteration 541 are Alice, Bob [2026-03-25 22:23:10,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:23:10,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:23:10,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:23:10,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:23:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:23:12,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:23:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:23:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:23:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:23:14,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:23:15,474][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:23:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:23:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:23:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:23:18,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:23:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:23:19,415][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:23:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:23:20,728][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:23:21,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:23:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:23:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:23:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:23:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:23:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:23:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:23:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:23:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:23:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:23:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:23:28,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:23:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:23:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:23:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:23:31,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:23:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:23:32,557][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:23:33,214][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:23:33,870][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:23:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:23:35,184][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:23:35,841][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:23:36,498][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:23:37,155][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:23:37,812][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:23:38,469][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:23:39,126][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:23:39,782][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:23:40,439][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:23:41,096][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:23:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:23:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:23:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:23:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:23:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:23:45,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:23:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:23:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:23:47,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:23:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:23:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:23:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:23:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:23:50,543][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:23:51,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:23:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:23:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:23:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:23:53,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:23:54,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:23:55,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:23:55,798][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:23:55,800][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:23:57,011][__main__][INFO] - Iteration 542 took 52s (9.77% Gen, 87.90% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 24m 36s. Estimated total time: 14h 27m 9s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 42s, 500 more iterations: 7h 13m 34s. [2026-03-25 22:23:57,014][__main__][INFO] - Starting iteration 542. [2026-03-25 22:23:57,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:23:57,021][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:24:01,844][__main__][INFO] - Number of regex retries in iteration 542: 0 [2026-03-25 22:24:01,846][__main__][INFO] - agents played in iteration 542 are Alice, Bob [2026-03-25 22:24:02,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:02,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:02,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:24:02,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:24:03,494][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:24:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:24:04,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:24:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:24:06,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:24:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:24:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:24:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:24:08,702][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:24:09,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:24:10,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:24:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:24:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:24:11,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:24:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:24:13,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:24:13,958][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:24:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:24:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:24:15,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:24:16,586][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:24:17,243][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:24:17,900][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:24:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:24:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:24:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:24:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:24:21,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:24:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:24:22,498][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:24:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:24:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:24:24,468][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:24:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:24:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:24:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:24:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:24:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:24:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:24:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:24:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:24:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:24:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:24:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:24:32,353][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:24:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:24:33,667][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:24:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:24:35,243][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:24:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:24:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:24:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:24:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:24:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:24:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:24:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:24:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:24:41,158][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:24:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:24:42,473][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:24:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:24:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:24:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:24:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:24:45,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:24:46,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:24:47,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:24:47,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:24:47,795][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:24:49,132][__main__][INFO] - Iteration 543 took 52s (9.26% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 25m 8s. Estimated total time: 14h 28m 33s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 16s. [2026-03-25 22:24:49,135][__main__][INFO] - Starting iteration 543. [2026-03-25 22:24:49,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:24:49,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:24:55,477][__main__][INFO] - Number of regex retries in iteration 543: 0 [2026-03-25 22:24:55,478][__main__][INFO] - agents played in iteration 543 are Alice, Bob [2026-03-25 22:24:56,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:56,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:56,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:24:56,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:24:56,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:24:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:24:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:24:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:24:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:25:00,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:25:00,846][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:25:01,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:25:02,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:25:02,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:25:03,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:25:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:25:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:25:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:25:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:25:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:25:07,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:25:08,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:25:08,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:25:09,383][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:25:10,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:25:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:25:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:25:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:25:12,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:25:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:25:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:25:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:25:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:25:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:25:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:25:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:25:17,923][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:25:18,581][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:25:19,238][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:25:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:25:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:25:21,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:25:21,865][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:25:22,521][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:25:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:25:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:25:24,491][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:25:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:25:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:25:26,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:25:27,120][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:25:27,777][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:25:28,731][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:25:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:25:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:25:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:25:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:25:32,015][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:25:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:25:33,329][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:25:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:25:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:25:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:25:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:25:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:25:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:25:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:25:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:25:39,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:25:39,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:25:41,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:25:41,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:25:41,155][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:25:42,571][__main__][INFO] - Iteration 544 took 53s (11.86% Gen, 85.48% Train). Generation: 6s, Training: 45s. Estimated remaining time: 6h 46m 15s. Estimated total time: 14h 50m 34s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 3s, 500 more iterations: 7h 25m 17s. [2026-03-25 22:25:42,574][__main__][INFO] - Starting iteration 544. [2026-03-25 22:25:42,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:25:42,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:25:47,463][__main__][INFO] - Number of regex retries in iteration 544: 0 [2026-03-25 22:25:47,464][__main__][INFO] - agents played in iteration 544 are Alice, Bob [2026-03-25 22:25:48,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:25:48,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:25:48,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:25:48,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:25:49,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:25:49,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:25:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:25:51,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:25:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:25:52,353][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:25:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:25:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:25:54,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:25:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:25:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:25:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:25:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:25:57,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:25:58,265][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:25:58,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:25:59,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:26:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:26:00,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:26:01,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:26:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:26:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:26:03,522][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:26:04,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:26:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:26:05,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:26:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:26:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:26:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:26:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:26:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:26:09,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:26:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:26:10,749][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:26:11,405][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:26:12,062][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:26:12,719][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:26:13,376][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:26:14,033][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:26:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:26:15,347][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:26:16,004][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:26:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:26:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:26:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:26:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:26:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:26:19,946][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:26:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:26:21,532][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:26:22,188][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:26:22,845][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:26:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:26:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:26:24,817][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:26:25,474][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:26:26,133][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:26:26,789][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:26:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:26:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:26:28,760][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:26:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:26:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:26:30,731][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:26:31,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:26:32,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:26:33,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:26:33,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:26:33,323][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:26:34,805][__main__][INFO] - Iteration 545 took 52s (9.35% Gen, 87.81% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 25m 17s. Estimated total time: 14h 30m 28s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 14s. [2026-03-25 22:26:34,807][__main__][INFO] - Starting iteration 545. [2026-03-25 22:26:34,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:26:34,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:26:39,493][__main__][INFO] - Number of regex retries in iteration 545: 0 [2026-03-25 22:26:39,494][__main__][INFO] - agents played in iteration 545 are Alice, Bob [2026-03-25 22:26:40,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:26:40,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:26:40,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:26:40,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:26:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:26:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:26:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:26:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:26:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:26:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:26:44,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:26:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:26:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:26:46,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:26:47,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:26:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:26:48,686][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:26:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:26:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:26:50,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:26:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:26:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:26:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:26:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:26:53,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:26:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:26:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:26:55,915][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:26:56,572][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:26:57,229][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:26:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:26:58,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:26:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:26:59,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:27:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:27:01,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:27:01,829][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:27:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:27:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:27:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:27:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:27:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:27:05,770][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:27:06,426][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:27:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:27:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:27:08,397][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:27:09,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:27:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:27:10,368][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:27:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:27:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:27:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:27:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:27:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:27:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:27:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:27:15,922][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:27:16,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:27:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:27:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:27:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:27:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:27:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:27:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:27:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:27:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:27:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:27:23,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:27:23,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:27:25,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:27:25,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:27:25,061][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:27:26,289][__main__][INFO] - Iteration 546 took 51s (9.10% Gen, 88.52% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 11m 57s. Estimated total time: 14h 17m 59s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 47s, 500 more iterations: 7h 8m 59s. [2026-03-25 22:27:26,292][__main__][INFO] - Starting iteration 546. [2026-03-25 22:27:26,296][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:27:26,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:27:31,122][__main__][INFO] - Number of regex retries in iteration 546: 0 [2026-03-25 22:27:31,123][__main__][INFO] - agents played in iteration 546 are Alice, Bob [2026-03-25 22:27:31,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:31,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:31,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:27:31,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:27:32,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:27:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:27:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:27:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:27:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:27:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:27:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:27:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:27:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:27:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:27:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:27:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:27:40,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:27:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:27:41,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:27:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:27:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:27:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:27:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:27:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:27:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:27:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:27:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:27:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:27:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:27:49,036][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:27:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:27:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:27:51,007][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:27:51,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:27:52,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:27:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:27:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:27:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:27:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:27:55,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:27:56,263][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:27:56,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:27:57,577][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:27:58,234][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:27:58,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:27:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:28:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:28:00,861][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:28:01,517][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:28:02,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:28:02,830][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:28:03,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:28:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:28:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:28:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:28:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:28:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:28:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:28:08,345][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:28:09,002][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:28:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:28:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:28:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:28:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:28:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:28:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:28:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:28:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:28:14,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:28:15,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:28:16,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:28:16,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:28:16,894][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:28:18,354][__main__][INFO] - Iteration 547 took 52s (9.27% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 20m 45s. Estimated total time: 14h 27m 39s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 49s. [2026-03-25 22:28:18,357][__main__][INFO] - Starting iteration 547. [2026-03-25 22:28:18,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:28:18,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:28:23,130][__main__][INFO] - Number of regex retries in iteration 547: 0 [2026-03-25 22:28:23,131][__main__][INFO] - agents played in iteration 547 are Alice, Bob [2026-03-25 22:28:23,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:28:23,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:28:23,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:28:23,732][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:28:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:28:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:28:25,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:28:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:28:26,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:28:27,636][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:28:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:28:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:28:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:28:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:28:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:28:31,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:28:32,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:28:32,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:28:33,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:28:34,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:28:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:28:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:28:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:28:36,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:28:37,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:28:38,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:28:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:28:39,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:28:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:28:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:28:41,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:28:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:28:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:28:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:28:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:28:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:28:45,373][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:28:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:28:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:28:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:28:48,000][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:28:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:28:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:28:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:28:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:28:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:28:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:28:52,598][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:28:53,255][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:28:53,912][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:28:54,570][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:28:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:28:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:28:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:28:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:28:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:28:58,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:28:59,426][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:29:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:29:00,739][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:29:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:29:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:29:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:29:03,368][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:29:04,025][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:29:04,682][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:29:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:29:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:29:06,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:29:07,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:29:08,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:29:08,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:29:08,606][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:29:09,960][__main__][INFO] - Iteration 548 took 51s (9.25% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 12m 15s. Estimated total time: 14h 20m 1s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 0s, 500 more iterations: 7h 10m 0s. [2026-03-25 22:29:09,962][__main__][INFO] - Starting iteration 548. [2026-03-25 22:29:09,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:29:09,969][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:29:15,609][__main__][INFO] - Number of regex retries in iteration 548: 0 [2026-03-25 22:29:15,610][__main__][INFO] - agents played in iteration 548 are Alice, Bob [2026-03-25 22:29:16,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:29:16,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:29:16,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:29:16,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:29:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:29:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:29:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:29:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:29:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:29:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:29:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:29:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:29:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:29:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:29:23,477][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:29:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:29:24,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:29:25,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:29:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:29:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:29:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:29:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:29:28,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:29:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:29:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:29:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:29:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:29:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:29:32,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:29:33,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:29:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:29:34,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:29:35,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:29:35,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:29:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:29:37,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:29:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:29:38,624][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:29:39,281][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:29:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:29:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:29:41,254][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:29:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:29:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:29:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:29:43,882][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:29:44,540][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:29:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:29:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:29:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:29:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:29:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:29:48,816][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:29:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:29:50,130][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:29:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:29:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:29:52,101][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:29:52,758][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:29:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:29:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:29:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:29:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:29:56,044][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:29:56,703][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:29:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:29:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:29:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:29:59,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:30:00,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:30:01,334][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:30:01,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:30:01,339][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:30:02,627][__main__][INFO] - Iteration 549 took 52s (10.71% Gen, 86.84% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 29m 3s. Estimated total time: 14h 37m 42s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 51s. [2026-03-25 22:30:02,630][__main__][INFO] - Starting iteration 549. [2026-03-25 22:30:02,634][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:30:02,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:30:07,947][__main__][INFO] - Number of regex retries in iteration 549: 0 [2026-03-25 22:30:07,948][__main__][INFO] - agents played in iteration 549 are Alice, Bob [2026-03-25 22:30:08,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:30:08,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:30:08,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:30:08,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:30:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:30:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:30:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:30:11,195][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:30:11,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:30:12,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:30:13,164][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:30:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:30:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:30:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:30:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:30:16,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:30:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:30:17,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:30:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:30:19,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:30:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:30:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:30:21,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:30:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:30:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:30:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:30:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:30:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:30:24,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:30:25,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:30:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:30:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:30:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:30:28,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:30:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:30:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:30:30,242][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:30:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:30:31,555][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:30:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:30:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:30:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:30:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:30:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:30:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:30:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:30:36,808][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:30:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:30:38,123][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:30:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:30:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:30:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:30:41,087][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:30:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:30:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:30:43,058][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:30:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:30:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:30:45,029][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:30:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:30:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:30:47,001][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:30:47,658][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:30:48,315][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:30:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:30:49,631][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:30:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:30:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:30:51,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:30:52,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:30:53,532][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:30:53,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:30:53,537][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:30:55,098][__main__][INFO] - Iteration 550 took 52s (10.13% Gen, 86.89% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 24m 54s. Estimated total time: 14h 34m 25s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 12s. [2026-03-25 22:30:55,101][__main__][INFO] - Starting iteration 550. [2026-03-25 22:30:55,105][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:30:55,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:30:59,852][__main__][INFO] - Number of regex retries in iteration 550: 0 [2026-03-25 22:30:59,854][__main__][INFO] - agents played in iteration 550 are Alice, Bob [2026-03-25 22:31:00,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:31:00,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:31:00,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:31:00,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:31:01,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:31:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:31:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:31:03,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:31:03,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:31:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:31:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:31:05,709][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:31:06,366][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:31:07,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:31:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:31:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:31:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:31:09,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:31:10,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:31:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:31:11,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:31:12,278][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:31:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:31:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:31:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:31:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:31:15,561][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:31:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:31:16,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:31:17,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:31:18,187][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:31:18,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:31:19,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:31:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:31:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:31:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:31:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:31:22,782][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:31:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:31:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:31:24,752][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:31:25,408][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:31:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:31:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:31:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:31:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:31:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:31:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:31:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:31:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:31:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:31:31,976][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:31:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:31:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:31:34,294][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:31:34,951][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:31:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:31:36,264][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:31:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:31:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:31:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:31:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:31:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:31:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:31:40,861][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:31:41,518][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:31:42,174][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:31:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:31:43,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:31:44,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:31:51,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:31:51,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:31:51,621][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:31:55,528][__main__][INFO] - Iteration 551 took 1m 0s (7.86% Gen, 85.67% Train). Generation: 4s, Training: 51s. Estimated remaining time: 8h 36m 33s. Estimated total time: 16h 47m 5s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 42s, 500 more iterations: 8h 23m 32s. [2026-03-25 22:31:55,532][__main__][INFO] - Starting iteration 551. [2026-03-25 22:31:55,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:31:55,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:32:00,594][__main__][INFO] - Number of regex retries in iteration 551: 0 [2026-03-25 22:32:00,596][__main__][INFO] - agents played in iteration 551 are Alice, Bob [2026-03-25 22:32:01,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:01,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:01,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:32:01,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:32:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:32:02,561][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:32:03,217][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:32:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:32:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:32:05,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:32:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:32:06,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:32:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:32:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:32:08,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:32:09,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:32:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:32:10,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:32:11,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:32:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:32:12,407][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:32:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:32:13,720][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:32:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:32:15,034][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:32:15,690][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:32:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:32:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:32:17,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:32:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:32:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:32:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:32:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:32:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:32:21,604][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:32:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:32:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:32:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:32:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:32:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:32:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:32:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:32:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:32:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:32:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:32:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:32:29,488][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:32:30,145][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:32:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:32:31,459][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:32:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:32:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:32:33,758][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:32:34,417][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:32:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:32:35,729][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:32:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:32:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:32:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:32:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:32:39,014][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:32:39,671][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:32:40,328][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:32:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:32:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:32:42,297][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:32:42,953][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:32:43,611][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:32:44,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:32:45,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:32:46,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:32:46,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:32:46,158][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:32:48,510][__main__][INFO] - Iteration 552 took 52s (9.55% Gen, 86.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 31m 31s. Estimated total time: 14h 42m 56s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 28s. [2026-03-25 22:32:48,513][__main__][INFO] - Starting iteration 552. [2026-03-25 22:32:48,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:32:48,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:32:53,511][__main__][INFO] - Number of regex retries in iteration 552: 0 [2026-03-25 22:32:53,513][__main__][INFO] - agents played in iteration 552 are Alice, Bob [2026-03-25 22:32:54,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:54,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:54,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:32:54,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:32:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:32:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:32:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:32:56,862][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:32:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:32:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:32:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:32:59,488][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:33:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:33:00,802][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:33:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:33:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:33:02,780][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:33:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:33:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:33:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:33:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:33:06,079][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:33:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:33:07,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:33:08,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:33:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:33:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:33:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:33:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:33:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:33:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:33:12,662][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:33:13,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:33:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:33:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:33:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:33:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:33:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:33:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:33:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:33:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:33:19,235][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:33:19,894][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:33:20,552][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:33:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:33:21,871][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:33:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:33:23,187][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:33:23,845][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:33:24,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:33:25,162][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:33:25,819][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:33:26,810][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:33:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:33:28,124][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:33:28,784][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:33:29,441][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:33:30,098][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:33:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:33:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:33:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:33:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:33:33,444][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:33:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:33:34,771][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:33:35,431][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:33:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:33:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:33:37,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:33:38,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:33:39,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:33:39,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:33:39,563][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:33:40,979][__main__][INFO] - Iteration 553 took 52s (9.52% Gen, 87.77% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 22m 6s. Estimated total time: 14h 34m 23s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 11s. [2026-03-25 22:33:40,984][__main__][INFO] - Starting iteration 553. [2026-03-25 22:33:40,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:33:40,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:33:44,940][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2026-03-25 22:33:46,672][__main__][INFO] - Number of regex retries in iteration 553: 1 [2026-03-25 22:33:46,673][__main__][INFO] - agents played in iteration 553 are Alice, Bob [2026-03-25 22:33:47,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:33:47,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:33:47,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:33:47,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:33:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:33:48,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:33:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:33:50,046][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:33:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:33:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:33:52,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:33:52,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:33:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:33:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:33:54,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:33:55,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:33:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:33:56,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:33:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:33:57,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:33:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:33:59,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:33:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:34:00,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:34:01,236][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:34:01,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:34:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:34:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:34:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:34:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:34:05,174][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:34:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:34:06,487][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:34:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:34:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:34:08,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:34:09,115][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:34:09,771][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:34:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:34:11,084][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:34:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:34:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:34:13,053][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:34:13,711][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:34:14,368][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:34:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:34:15,687][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:34:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:34:17,000][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:34:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:34:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:34:18,970][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:34:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:34:20,638][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:34:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:34:21,952][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:34:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:34:23,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:34:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:34:24,584][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:34:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:34:25,900][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:34:26,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:34:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:34:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:34:28,530][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:34:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:34:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:34:30,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:34:31,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:34:32,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:34:32,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:34:32,442][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:34:33,954][__main__][INFO] - Iteration 554 took 52s (10.73% Gen, 86.41% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 29m 37s. Estimated total time: 14h 42m 47s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 16s, 500 more iterations: 7h 21m 23s. [2026-03-25 22:34:33,957][__main__][INFO] - Starting iteration 554. [2026-03-25 22:34:33,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:34:33,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:34:42,267][__main__][INFO] - Number of regex retries in iteration 554: 0 [2026-03-25 22:34:42,270][__main__][INFO] - agents played in iteration 554 are Alice, Bob [2026-03-25 22:34:42,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:34:42,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:34:42,951][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:34:42,951][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:34:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:34:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:34:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:34:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:34:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:34:46,946][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:34:47,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:34:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:34:48,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:34:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:34:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:34:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:34:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:34:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:34:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:34:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:34:54,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:34:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:34:55,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:34:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:34:56,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:34:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:34:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:34:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:34:59,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:35:00,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:35:00,745][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:35:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:35:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:35:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:35:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:35:04,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:35:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:35:05,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:35:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:35:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:35:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:35:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:35:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:35:09,284][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:35:09,941][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:35:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:35:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:35:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:35:12,567][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:35:13,223][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:35:13,880][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:35:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:35:15,518][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:35:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:35:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:35:17,489][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:35:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:35:18,802][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:35:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:35:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:35:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:35:21,427][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:35:22,084][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:35:22,741][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:35:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:35:24,054][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:35:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:35:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:35:26,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:35:26,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:35:28,034][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:35:28,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:35:28,038][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:35:29,244][__main__][INFO] - Iteration 555 took 55s (15.03% Gen, 82.79% Train). Generation: 8s, Training: 45s. Estimated remaining time: 7h 7m 18s. Estimated total time: 15h 21m 23s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 41s. [2026-03-25 22:35:29,246][__main__][INFO] - Starting iteration 555. [2026-03-25 22:35:29,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:35:29,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:35:34,166][__main__][INFO] - Number of regex retries in iteration 555: 0 [2026-03-25 22:35:34,167][__main__][INFO] - agents played in iteration 555 are Alice, Bob [2026-03-25 22:35:34,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:34,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:34,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:35:34,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:35:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:35:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:35:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:35:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:35:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:35:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:35:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:35:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:35:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:35:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:35:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:35:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:35:43,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:35:44,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:35:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:35:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:35:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:35:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:35:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:35:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:35:48,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:35:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:35:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:35:50,579][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:35:51,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:35:51,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:35:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:35:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:35:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:35:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:35:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:35:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:35:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:35:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:35:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:35:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:35:59,122][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:35:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:36:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:36:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:36:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:36:02,408][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:36:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:36:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:36:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:36:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:36:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:36:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:36:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:36:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:36:08,645][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:36:09,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:36:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:36:10,616][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:36:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:36:11,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:36:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:36:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:36:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:36:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:36:15,217][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:36:15,874][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:36:16,532][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:36:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:36:17,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:36:18,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:36:19,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:36:19,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:36:19,652][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:36:21,119][__main__][INFO] - Iteration 556 took 51s (9.48% Gen, 87.69% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 9m 33s. Estimated total time: 14h 24m 30s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 15s. [2026-03-25 22:36:21,121][__main__][INFO] - Starting iteration 556. [2026-03-25 22:36:21,125][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:36:21,126][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:36:26,302][__main__][INFO] - Number of regex retries in iteration 556: 0 [2026-03-25 22:36:26,303][__main__][INFO] - agents played in iteration 556 are Alice, Bob [2026-03-25 22:36:26,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:36:26,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:36:26,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:36:26,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:36:27,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:36:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:36:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:36:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:36:30,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:36:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:36:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:36:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:36:33,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:36:33,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:36:34,512][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:36:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:36:35,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:36:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:36:37,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:36:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:36:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:36:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:36:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:36:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:36:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:36:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:36:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:36:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:36:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:36:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:36:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:36:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:36:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:36:47,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:36:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:36:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:36:49,029][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:36:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:36:50,344][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:36:51,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:36:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:36:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:36:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:36:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:36:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:36:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:36:55,618][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:36:56,278][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:36:56,935][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:36:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:36:58,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:36:58,906][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:37:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:37:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:37:01,326][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:37:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:37:02,640][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:37:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:37:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:37:04,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:37:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:37:05,928][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:37:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:37:07,242][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:37:07,900][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:37:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:37:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:37:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:37:10,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:37:11,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:37:12,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:37:12,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:37:12,480][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:37:13,833][__main__][INFO] - Iteration 557 took 52s (9.82% Gen, 87.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 22m 40s. Estimated total time: 14h 38m 29s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 14s. [2026-03-25 22:37:13,836][__main__][INFO] - Starting iteration 557. [2026-03-25 22:37:13,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:37:13,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:37:18,912][__main__][INFO] - Number of regex retries in iteration 557: 0 [2026-03-25 22:37:18,914][__main__][INFO] - agents played in iteration 557 are Alice, Bob [2026-03-25 22:37:19,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:37:19,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:37:19,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:37:19,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:37:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:37:21,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:37:21,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:37:22,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:37:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:37:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:37:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:37:25,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:37:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:37:26,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:37:27,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:37:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:37:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:37:29,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:37:29,675][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:37:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:37:30,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:37:31,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:37:32,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:37:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:37:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:37:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:37:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:37:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:37:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:37:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:37:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:37:38,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:37:38,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:37:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:37:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:37:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:37:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:37:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:37:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:37:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:37:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:37:44,794][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:37:45,451][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:37:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:37:46,765][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:37:47,422][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:37:48,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:37:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:37:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:37:50,050][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:37:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:37:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:37:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:37:53,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:37:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:37:54,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:37:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:37:55,671][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:37:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:37:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:37:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:37:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:37:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:37:59,616][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:38:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:38:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:38:01,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:38:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:38:02,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:38:03,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:38:04,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:38:04,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:38:04,761][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:38:06,144][__main__][INFO] - Iteration 558 took 52s (9.70% Gen, 87.65% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 15m 3s. Estimated total time: 14h 31m 45s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 52s. [2026-03-25 22:38:06,146][__main__][INFO] - Starting iteration 558. [2026-03-25 22:38:06,153][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:38:06,154][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:38:11,346][__main__][INFO] - Number of regex retries in iteration 558: 0 [2026-03-25 22:38:11,348][__main__][INFO] - agents played in iteration 558 are Alice, Bob [2026-03-25 22:38:11,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:38:12,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:38:12,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:38:12,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:38:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:38:13,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:38:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:38:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:38:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:38:16,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:38:16,675][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:38:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:38:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:38:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:38:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:38:19,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:38:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:38:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:38:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:38:22,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:38:23,246][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:38:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:38:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:38:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:38:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:38:26,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:38:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:38:27,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:38:28,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:38:29,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:38:29,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:38:30,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:38:31,129][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:38:31,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:38:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:38:33,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:38:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:38:34,414][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:38:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:38:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:38:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:38:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:38:37,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:38:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:38:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:38:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:38:40,326][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:38:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:38:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:38:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:38:42,955][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:38:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:38:44,603][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:38:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:38:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:38:46,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:38:47,235][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:38:47,892][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:38:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:38:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:38:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:38:50,520][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:38:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:38:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:38:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:38:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:38:53,804][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:38:54,462][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:38:55,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:38:55,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:38:57,365][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:38:57,368][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:38:57,373][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:38:58,781][__main__][INFO] - Iteration 559 took 52s (9.87% Gen, 87.45% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 19m 36s. Estimated total time: 14h 37m 11s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 35s. [2026-03-25 22:38:58,784][__main__][INFO] - Starting iteration 559. [2026-03-25 22:38:58,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:38:58,788][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:39:03,540][__main__][INFO] - Number of regex retries in iteration 559: 0 [2026-03-25 22:39:03,542][__main__][INFO] - agents played in iteration 559 are Alice, Bob [2026-03-25 22:39:04,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:04,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:04,228][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:39:04,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:39:04,910][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:39:05,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:39:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:39:06,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:39:07,494][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:39:08,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:39:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:39:09,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:39:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:39:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:39:11,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:39:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:39:12,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:39:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:39:14,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:39:14,721][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:39:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:39:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:39:16,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:39:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:39:18,006][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:39:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:39:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:39:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:39:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:39:21,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:39:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:39:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:39:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:39:23,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:39:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:39:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:39:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:39:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:39:27,233][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:39:27,898][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:39:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:39:29,221][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:39:29,884][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:39:30,549][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:39:31,214][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:39:31,876][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:39:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:39:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:39:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:39:34,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:39:35,190][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:39:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:39:36,846][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:39:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:39:38,161][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:39:38,818][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:39:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:39:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:39:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:39:41,446][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:39:42,103][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:39:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:39:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:39:44,074][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:39:44,730][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:39:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:39:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:39:46,701][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:39:47,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:39:48,190][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:39:49,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:39:49,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:39:49,332][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:39:50,719][__main__][INFO] - Iteration 560 took 51s (9.15% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 7m 5s. Estimated total time: 14h 25m 32s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 46s. [2026-03-25 22:39:50,721][__main__][INFO] - Starting iteration 560. [2026-03-25 22:39:50,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:39:50,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:39:56,077][__main__][INFO] - Number of regex retries in iteration 560: 0 [2026-03-25 22:39:56,078][__main__][INFO] - agents played in iteration 560 are Alice, Bob [2026-03-25 22:39:56,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:56,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:56,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:39:56,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:39:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:39:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:39:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:39:59,410][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:40:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:40:00,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:40:01,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:40:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:40:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:40:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:40:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:40:04,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:40:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:40:05,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:40:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:40:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:40:07,952][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:40:08,610][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:40:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:40:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:40:10,580][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:40:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:40:11,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:40:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:40:13,208][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:40:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:40:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:40:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:40:15,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:40:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:40:17,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:40:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:40:18,462][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:40:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:40:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:40:20,434][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:40:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:40:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:40:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:40:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:40:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:40:24,392][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:40:25,048][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:40:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:40:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:40:27,019][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:40:27,678][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:40:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:40:29,316][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:40:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:40:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:40:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:40:31,945][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:40:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:40:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:40:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:40:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:40:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:40:35,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:40:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:40:37,201][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:40:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:40:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:40:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:40:39,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:40:40,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:40:41,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:40:41,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:40:41,762][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:40:43,124][__main__][INFO] - Iteration 561 took 52s (10.21% Gen, 87.18% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 14m 1s. Estimated total time: 14h 33m 20s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 40s. [2026-03-25 22:40:43,127][__main__][INFO] - Starting iteration 561. [2026-03-25 22:40:43,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:40:43,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:40:47,987][__main__][INFO] - Number of regex retries in iteration 561: 0 [2026-03-25 22:40:47,989][__main__][INFO] - agents played in iteration 561 are Alice, Bob [2026-03-25 22:40:48,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:40:48,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:40:48,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:40:48,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:40:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:40:49,989][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:40:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:40:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:40:51,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:40:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:40:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:40:53,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:40:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:40:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:40:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:40:56,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:40:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:40:57,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:40:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:40:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:40:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:41:00,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:41:01,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:41:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:41:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:41:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:41:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:41:04,440][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:41:05,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:41:05,753][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:41:06,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:41:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:41:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:41:08,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:41:09,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:41:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:41:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:41:11,008][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:41:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:41:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:41:12,979][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:41:13,636][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:41:14,293][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:41:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:41:15,606][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:41:16,263][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:41:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:41:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:41:18,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:41:18,890][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:41:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:41:20,203][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:41:21,114][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:41:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:41:22,428][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:41:23,084][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:41:23,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:41:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:41:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:41:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:41:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:41:27,033][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:41:27,690][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:41:28,347][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:41:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:41:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:41:30,319][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:41:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:41:31,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:41:32,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:41:33,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:41:33,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:41:33,627][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:41:34,978][__main__][INFO] - Iteration 562 took 51s (9.36% Gen, 88.02% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 3m 58s. Estimated total time: 14h 24m 9s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2026-03-25 22:41:34,982][__main__][INFO] - Starting iteration 562. [2026-03-25 22:41:34,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:41:34,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:41:39,843][__main__][INFO] - Number of regex retries in iteration 562: 0 [2026-03-25 22:41:39,845][__main__][INFO] - agents played in iteration 562 are Alice, Bob [2026-03-25 22:41:40,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:41:40,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:41:40,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:41:40,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:41:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:41:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:41:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:41:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:41:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:41:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:41:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:41:45,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:41:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:41:47,077][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:41:47,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:41:48,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:41:49,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:41:49,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:41:50,366][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:41:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:41:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:41:52,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:41:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:41:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:41:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:41:54,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:41:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:41:56,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:41:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:41:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:41:58,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:41:58,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:41:59,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:42:00,225][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:42:00,882][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:42:01,539][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:42:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:42:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:42:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:42:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:42:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:42:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:42:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:42:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:42:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:42:08,107][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:42:08,763][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:42:09,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:42:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:42:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:42:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:42:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:42:13,053][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:42:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:42:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:42:15,023][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:42:15,680][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:42:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:42:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:42:17,651][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:42:18,307][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:42:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:42:19,620][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:42:20,278][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:42:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:42:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:42:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:42:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:42:23,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:42:24,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:42:25,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:42:25,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:42:25,520][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:42:26,953][__main__][INFO] - Iteration 563 took 51s (9.35% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 5m 5s. Estimated total time: 14h 26m 8s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 4s. [2026-03-25 22:42:26,955][__main__][INFO] - Starting iteration 563. [2026-03-25 22:42:26,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:42:26,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:42:31,716][__main__][INFO] - Number of regex retries in iteration 563: 0 [2026-03-25 22:42:31,717][__main__][INFO] - agents played in iteration 563 are Alice, Bob [2026-03-25 22:42:32,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:42:32,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:42:32,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:42:32,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:42:32,966][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:42:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:42:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:42:34,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:42:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:42:36,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:42:36,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:42:37,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:42:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:42:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:42:39,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:42:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:42:40,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:42:41,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:42:42,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:42:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:42:43,430][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:42:44,087][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:42:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:42:45,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:42:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:42:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:42:47,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:42:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:42:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:42:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:42:50,003][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:42:50,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:42:51,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:42:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:42:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:42:53,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:42:53,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:42:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:42:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:42:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:42:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:42:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:42:57,888][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:42:58,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:42:59,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:42:59,860][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:43:00,517][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:43:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:43:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:43:02,487][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:43:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:43:03,801][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:43:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:43:05,452][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:43:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:43:06,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:43:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:43:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:43:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:43:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:43:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:43:10,711][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:43:11,368][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:43:12,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:43:12,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:43:13,339][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:43:13,996][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:43:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:43:15,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:43:16,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:43:17,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:43:17,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:43:17,325][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:43:19,254][__main__][INFO] - Iteration 564 took 52s (9.09% Gen, 87.21% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 9m 42s. Estimated total time: 14h 31m 37s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 48s. [2026-03-25 22:43:19,257][__main__][INFO] - Starting iteration 564. [2026-03-25 22:43:19,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:43:19,261][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:43:24,217][__main__][INFO] - Number of regex retries in iteration 564: 0 [2026-03-25 22:43:24,219][__main__][INFO] - agents played in iteration 564 are Alice, Bob [2026-03-25 22:43:24,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:43:24,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:43:24,841][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:43:24,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:43:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:43:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:43:26,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:43:27,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:43:28,094][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:43:28,750][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:43:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:43:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:43:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:43:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:43:32,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:43:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:43:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:43:34,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:43:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:43:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:43:35,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:43:36,644][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:43:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:43:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:43:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:43:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:43:39,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:43:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:43:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:43:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:43:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:43:43,227][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:43:43,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:43:44,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:43:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:43:45,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:43:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:43:47,187][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:43:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:43:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:43:49,161][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:43:49,820][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:43:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:43:51,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:43:51,805][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:43:52,464][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:43:53,126][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:43:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:43:54,446][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:43:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:43:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:43:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:43:57,443][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:43:58,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:43:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:43:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:44:00,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:44:00,739][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:44:01,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:44:02,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:44:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:44:03,374][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:44:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:44:04,688][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:44:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:44:06,003][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:44:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:44:07,319][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:44:07,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:44:08,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:44:10,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:44:10,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:44:10,117][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:44:11,550][__main__][INFO] - Iteration 565 took 52s (9.48% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 8m 43s. Estimated total time: 14h 31m 31s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 45s. [2026-03-25 22:44:11,552][__main__][INFO] - Starting iteration 565. [2026-03-25 22:44:11,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:44:11,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:44:17,276][__main__][INFO] - Number of regex retries in iteration 565: 0 [2026-03-25 22:44:17,277][__main__][INFO] - agents played in iteration 565 are Alice, Bob [2026-03-25 22:44:17,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:44:17,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:44:17,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:44:17,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:44:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:44:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:44:20,088][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:44:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:44:21,416][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:44:22,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:44:22,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:44:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:44:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:44:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:44:25,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:44:26,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:44:26,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:44:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:44:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:44:28,708][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:44:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:44:30,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:44:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:44:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:44:32,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:44:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:44:33,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:44:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:44:34,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:44:35,334][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:44:36,000][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:44:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:44:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:44:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:44:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:44:39,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:44:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:44:40,602][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:44:41,261][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:44:41,918][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:44:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:44:43,232][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:44:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:44:44,548][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:44:45,206][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:44:45,863][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:44:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:44:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:44:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:44:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:44:49,149][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:44:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:44:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:44:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:44:52,145][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:44:52,802][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:44:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:44:54,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:44:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:44:55,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:44:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:44:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:44:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:44:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:44:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:44:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:45:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:45:00,691][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:45:01,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:45:02,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:45:03,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:45:03,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:45:03,476][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:45:04,879][__main__][INFO] - Iteration 566 took 53s (10.73% Gen, 86.64% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 25m 3s. Estimated total time: 14h 48m 44s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 52s, 500 more iterations: 7h 24m 22s. [2026-03-25 22:45:04,882][__main__][INFO] - Starting iteration 566. [2026-03-25 22:45:04,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:45:04,886][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:45:09,982][__main__][INFO] - Number of regex retries in iteration 566: 0 [2026-03-25 22:45:09,984][__main__][INFO] - agents played in iteration 566 are Alice, Bob [2026-03-25 22:45:10,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:10,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:10,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:45:10,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:45:11,588][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:45:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:45:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:45:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:45:14,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:45:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:45:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:45:16,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:45:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:45:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:45:18,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:45:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:45:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:45:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:45:20,734][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:45:21,391][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:45:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:45:22,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:45:23,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:45:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:45:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:45:25,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:45:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:45:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:45:27,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:45:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:45:28,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:45:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:45:29,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:45:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:45:31,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:45:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:45:32,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:45:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:45:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:45:34,529][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:45:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:45:35,844][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:45:36,501][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:45:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:45:37,815][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:45:38,472][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:45:39,129][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:45:39,786][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:45:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:45:41,100][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:45:41,757][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:45:42,414][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:45:43,404][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:45:44,062][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:45:44,720][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:45:45,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:45:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:45:46,690][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:45:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:45:48,005][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:45:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:45:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:45:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:45:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:45:51,292][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:45:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:45:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:45:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:45:53,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:45:54,829][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:45:56,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:45:56,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:45:56,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:45:57,454][__main__][INFO] - Iteration 567 took 52s (9.70% Gen, 87.62% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 11m 37s. Estimated total time: 14h 36m 10s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 5s. [2026-03-25 22:45:57,457][__main__][INFO] - Starting iteration 567. [2026-03-25 22:45:57,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:45:57,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:46:02,998][__main__][INFO] - Number of regex retries in iteration 567: 0 [2026-03-25 22:46:02,999][__main__][INFO] - agents played in iteration 567 are Alice, Bob [2026-03-25 22:46:03,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:03,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:03,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:46:03,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:46:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:46:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:46:05,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:46:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:46:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:46:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:46:08,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:46:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:46:09,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:46:10,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:46:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:46:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:46:12,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:46:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:46:13,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:46:14,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:46:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:46:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:46:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:46:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:46:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:46:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:46:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:46:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:46:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:46:20,904][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:46:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:46:22,220][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:46:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:46:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:46:24,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:46:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:46:25,506][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:46:26,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:46:26,822][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:46:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:46:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:46:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:46:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:46:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:46:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:46:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:46:32,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:46:32,736][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:46:33,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:46:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:46:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:46:35,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:46:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:46:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:46:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:46:38,343][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:46:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:46:39,657][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:46:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:46:40,971][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:46:41,629][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:46:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:46:42,942][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:46:43,602][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:46:44,259][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:46:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:46:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:46:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:46:46,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:46:47,705][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:46:48,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:46:48,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:46:48,867][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:46:50,223][__main__][INFO] - Iteration 568 took 52s (10.49% Gen, 86.93% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 13m 57s. Estimated total time: 14h 39m 23s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 41s. [2026-03-25 22:46:50,226][__main__][INFO] - Starting iteration 568. [2026-03-25 22:46:50,230][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:46:50,231][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:46:55,173][__main__][INFO] - Number of regex retries in iteration 568: 0 [2026-03-25 22:46:55,174][__main__][INFO] - agents played in iteration 568 are Alice, Bob [2026-03-25 22:46:55,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:55,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:55,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:46:55,906][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:46:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:46:57,246][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:46:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:46:58,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:46:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:46:59,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:47:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:47:01,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:47:01,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:47:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:47:03,157][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:47:03,814][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:47:04,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:47:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:47:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:47:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:47:07,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:47:07,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:47:08,412][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:47:09,069][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:47:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:47:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:47:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:47:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:47:12,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:47:13,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:47:13,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:47:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:47:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:47:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:47:16,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:47:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:47:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:47:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:47:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:47:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:47:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:47:20,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:47:21,556][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:47:22,212][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:47:22,870][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:47:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:47:24,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:47:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:47:25,498][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:47:26,156][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:47:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:47:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:47:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:47:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:47:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:47:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:47:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:47:31,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:47:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:47:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:47:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:47:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:47:35,049][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:47:35,707][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:47:36,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:47:37,021][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:47:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:47:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:47:38,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:47:39,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:47:41,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:47:41,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:47:41,020][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:47:42,514][__main__][INFO] - Iteration 569 took 52s (9.45% Gen, 87.68% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 5m 7s. Estimated total time: 14h 31m 25s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 42s. [2026-03-25 22:47:42,516][__main__][INFO] - Starting iteration 569. [2026-03-25 22:47:42,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:47:42,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:47:47,510][__main__][INFO] - Number of regex retries in iteration 569: 0 [2026-03-25 22:47:47,512][__main__][INFO] - agents played in iteration 569 are Alice, Bob [2026-03-25 22:47:48,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:47:48,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:47:48,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:47:48,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:47:48,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:47:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:47:50,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:47:50,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:47:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:47:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:47:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:47:53,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:47:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:47:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:47:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:47:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:47:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:47:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:47:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:47:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:47:59,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:47:59,981][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:48:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:48:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:48:01,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:48:02,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:48:03,267][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:48:03,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:48:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:48:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:48:05,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:48:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:48:07,211][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:48:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:48:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:48:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:48:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:48:10,496][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:48:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:48:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:48:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:48:13,123][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:48:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:48:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:48:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:48:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:48:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:48:17,065][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:48:17,721][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:48:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:48:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:48:19,693][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:48:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:48:21,343][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:48:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:48:22,657][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:48:23,314][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:48:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:48:24,628][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:48:25,285][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:48:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:48:26,600][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:48:27,258][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:48:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:48:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:48:29,229][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:48:29,887][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:48:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:48:31,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:48:31,997][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:48:33,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:48:33,179][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:48:33,180][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:48:34,730][__main__][INFO] - Iteration 570 took 52s (9.56% Gen, 87.47% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 3m 0s. Estimated total time: 14h 30m 11s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 5s. [2026-03-25 22:48:34,732][__main__][INFO] - Starting iteration 570. [2026-03-25 22:48:34,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:48:34,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:48:39,637][__main__][INFO] - Number of regex retries in iteration 570: 0 [2026-03-25 22:48:39,639][__main__][INFO] - agents played in iteration 570 are Alice, Bob [2026-03-25 22:48:40,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:48:40,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:48:40,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:48:40,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:48:41,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:48:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:48:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:48:42,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:48:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:48:44,302][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:48:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:48:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:48:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:48:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:48:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:48:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:48:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:48:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:48:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:48:50,874][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:48:51,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:48:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:48:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:48:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:48:54,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:48:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:48:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:48:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:48:56,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:48:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:48:58,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:48:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:48:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:49:00,127][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:49:00,789][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:49:01,452][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:49:02,117][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:49:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:49:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:49:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:49:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:49:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:49:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:49:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:49:07,391][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:49:08,052][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:49:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:49:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:49:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:49:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:49:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:49:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:49:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:49:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:49:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:49:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:49:15,630][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:49:16,288][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:49:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:49:17,603][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:49:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:49:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:49:19,575][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:49:20,232][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:49:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:49:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:49:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:49:22,861][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:49:23,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:49:24,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:49:25,554][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:49:25,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:49:25,558][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:49:27,084][__main__][INFO] - Iteration 571 took 52s (9.36% Gen, 87.72% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 4m 26s. Estimated total time: 14h 32m 29s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 14s. [2026-03-25 22:49:27,088][__main__][INFO] - Starting iteration 571. [2026-03-25 22:49:27,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:49:27,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:49:31,910][__main__][INFO] - Number of regex retries in iteration 571: 0 [2026-03-25 22:49:31,921][__main__][INFO] - agents played in iteration 571 are Alice, Bob [2026-03-25 22:49:32,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:49:32,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:49:32,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:49:32,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:49:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:49:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:49:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:49:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:49:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:49:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:49:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:49:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:49:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:49:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:49:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:49:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:49:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:49:41,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:49:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:49:43,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:49:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:49:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:49:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:49:45,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:49:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:49:47,088][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:49:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:49:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:49:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:49:49,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:49:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:49:51,028][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:49:51,685][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:49:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:49:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:49:53,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:49:54,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:49:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:49:55,628][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:49:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:49:56,942][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:49:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:49:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:49:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:49:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:50:00,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:50:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:50:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:50:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:50:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:50:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:50:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:50:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:50:05,779][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:50:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:50:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:50:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:50:08,408][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:50:09,065][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:50:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:50:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:50:11,037][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:50:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:50:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:50:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:50:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:50:14,323][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:50:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:50:15,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:50:16,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:50:17,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:50:17,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:50:17,662][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:50:19,131][__main__][INFO] - Iteration 572 took 52s (9.28% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 58m 25s. Estimated total time: 14h 27m 20s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 40s. [2026-03-25 22:50:19,134][__main__][INFO] - Starting iteration 572. [2026-03-25 22:50:19,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:50:19,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:50:23,918][__main__][INFO] - Number of regex retries in iteration 572: 0 [2026-03-25 22:50:23,920][__main__][INFO] - agents played in iteration 572 are Alice, Bob [2026-03-25 22:50:24,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:50:24,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:50:24,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:50:24,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:50:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:50:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:50:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:50:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:50:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:50:28,462][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:50:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:50:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:50:30,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:50:31,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:50:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:50:32,403][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:50:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:50:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:50:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:50:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:50:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:50:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:50:37,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:50:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:50:38,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:50:38,972][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:50:39,630][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:50:40,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:50:40,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:50:41,601][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:50:42,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:50:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:50:43,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:50:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:50:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:50:45,543][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:50:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:50:46,856][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:50:47,513][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:50:48,170][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:50:48,827][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:50:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:50:50,141][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:50:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:50:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:50:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:50:52,768][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:50:53,425][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:50:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:50:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:50:55,396][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:50:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:50:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:50:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:50:58,329][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:50:58,985][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:50:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:51:00,299][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:51:00,956][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:51:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:51:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:51:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:51:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:51:04,240][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:51:04,897][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:51:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:51:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:51:06,868][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:51:07,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:51:08,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:51:09,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:51:09,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:51:09,464][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:51:10,898][__main__][INFO] - Iteration 573 took 51s (9.24% Gen, 87.99% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 52m 54s. Estimated total time: 14h 22m 41s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 20s. [2026-03-25 22:51:10,901][__main__][INFO] - Starting iteration 573. [2026-03-25 22:51:10,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:51:10,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:51:15,645][__main__][INFO] - Number of regex retries in iteration 573: 0 [2026-03-25 22:51:15,647][__main__][INFO] - agents played in iteration 573 are Alice, Bob [2026-03-25 22:51:16,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:51:16,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:51:16,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:51:16,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:51:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:51:17,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:51:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:51:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:51:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:51:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:51:20,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:51:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:51:22,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:51:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:51:23,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:51:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:51:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:51:25,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:51:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:51:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:51:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:51:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:51:28,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:51:29,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:51:30,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:51:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:51:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:51:32,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:51:32,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:51:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:51:34,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:51:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:51:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:51:36,030][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:51:36,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:51:37,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:51:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:51:38,659][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:51:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:51:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:51:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:51:41,286][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:51:41,943][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:51:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:51:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:51:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:51:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:51:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:51:45,885][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:51:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:51:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:51:47,856][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:51:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:51:49,476][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:51:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:51:50,791][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:51:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:51:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:51:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:51:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:51:54,076][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:51:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:51:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:51:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:51:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:51:57,362][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:51:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:51:58,676][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:51:59,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:52:00,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:52:01,250][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:52:01,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:52:01,255][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:52:02,638][__main__][INFO] - Iteration 574 took 51s (9.16% Gen, 88.16% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 51m 36s. Estimated total time: 14h 22m 15s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 13s, 500 more iterations: 7h 11m 7s. [2026-03-25 22:52:02,641][__main__][INFO] - Starting iteration 574. [2026-03-25 22:52:02,653][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:52:02,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:52:08,399][__main__][INFO] - Number of regex retries in iteration 574: 0 [2026-03-25 22:52:08,400][__main__][INFO] - agents played in iteration 574 are Alice, Bob [2026-03-25 22:52:08,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:09,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:09,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:52:09,005][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:52:09,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:52:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:52:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:52:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:52:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:52:12,922][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:52:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:52:14,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:52:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:52:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:52:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:52:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:52:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:52:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:52:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:52:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:52:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:52:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:52:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:52:22,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:52:22,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:52:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:52:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:52:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:52:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:52:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:52:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:52:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:52:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:52:28,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:52:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:52:29,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:52:30,655][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:52:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:52:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:52:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:52:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:52:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:52:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:52:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:52:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:52:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:52:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:52:37,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:52:38,538][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:52:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:52:39,852][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:52:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:52:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:52:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:52:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:52:43,438][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:52:44,095][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:52:44,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:52:45,408][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:52:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:52:46,721][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:52:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:52:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:52:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:52:49,349][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:52:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:52:50,663][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:52:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:52:51,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:52:52,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:52:54,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:52:54,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:52:54,055][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:52:55,532][__main__][INFO] - Iteration 575 took 52s (10.87% Gen, 86.33% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 9m 49s. Estimated total time: 14h 41m 20s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 40s. [2026-03-25 22:52:55,535][__main__][INFO] - Starting iteration 575. [2026-03-25 22:52:55,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:52:55,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:53:00,288][__main__][INFO] - Number of regex retries in iteration 575: 0 [2026-03-25 22:53:00,290][__main__][INFO] - agents played in iteration 575 are Alice, Bob [2026-03-25 22:53:00,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:00,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:00,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:53:00,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:53:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:53:02,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:53:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:53:03,591][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:53:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:53:04,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:53:05,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:53:06,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:53:06,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:53:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:53:08,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:53:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:53:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:53:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:53:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:53:11,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:53:12,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:53:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:53:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:53:14,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:53:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:53:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:53:16,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:53:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:53:17,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:53:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:53:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:53:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:53:20,013][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:53:20,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:53:21,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:53:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:53:22,642][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:53:23,299][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:53:23,956][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:53:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:53:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:53:25,928][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:53:26,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:53:27,243][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:53:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:53:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:53:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:53:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:53:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:53:31,185][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:53:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:53:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:53:33,418][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:53:34,075][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:53:34,731][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:53:35,388][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:53:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:53:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:53:37,358][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:53:38,015][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:53:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:53:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:53:39,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:53:40,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:53:41,299][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:53:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:53:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:53:43,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:53:43,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:53:44,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:53:45,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:53:45,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:53:45,986][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:53:47,512][__main__][INFO] - Iteration 576 took 51s (9.14% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 53m 51s. Estimated total time: 14h 26m 14s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 7s. [2026-03-25 22:53:47,515][__main__][INFO] - Starting iteration 576. [2026-03-25 22:53:47,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:53:47,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:53:52,353][__main__][INFO] - Number of regex retries in iteration 576: 0 [2026-03-25 22:53:52,354][__main__][INFO] - agents played in iteration 576 are Alice, Bob [2026-03-25 22:53:52,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:52,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:52,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:53:52,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:53:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:53:54,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:53:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:53:55,572][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:53:56,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:53:56,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:53:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:53:58,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:53:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:53:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:54:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:54:00,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:54:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:54:02,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:54:02,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:54:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:54:04,113][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:54:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:54:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:54:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:54:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:54:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:54:08,056][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:54:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:54:09,370][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:54:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:54:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:54:11,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:54:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:54:12,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:54:13,311][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:54:13,968][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:54:14,625][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:54:15,282][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:54:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:54:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:54:17,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:54:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:54:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:54:19,224][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:54:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:54:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:54:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:54:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:54:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:54:23,164][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:54:23,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:54:24,479][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:54:25,457][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:54:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:54:26,772][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:54:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:54:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:54:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:54:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:54:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:54:30,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:54:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:54:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:54:32,683][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:54:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:54:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:54:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:54:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:54:35,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:54:36,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:54:37,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:54:37,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:54:37,891][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:54:39,205][__main__][INFO] - Iteration 577 took 51s (9.35% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 48m 12s. Estimated total time: 14h 21m 28s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 8s, 500 more iterations: 7h 10m 44s. [2026-03-25 22:54:39,208][__main__][INFO] - Starting iteration 577. [2026-03-25 22:54:39,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:54:39,212][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:54:44,686][__main__][INFO] - Number of regex retries in iteration 577: 0 [2026-03-25 22:54:44,687][__main__][INFO] - agents played in iteration 577 are Alice, Bob [2026-03-25 22:54:45,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:54:45,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:54:45,367][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:54:45,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:54:46,062][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:54:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:54:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:54:47,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:54:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:54:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:54:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:54:50,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:54:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:54:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:54:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:54:53,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:54:53,900][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:54:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:54:55,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:54:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:54:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:54:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:54:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:54:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:54:59,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:54:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:55:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:55:01,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:55:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:55:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:55:03,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:55:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:55:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:55:05,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:55:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:55:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:55:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:55:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:55:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:55:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:55:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:55:10,335][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:55:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:55:11,649][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:55:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:55:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:55:13,620][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:55:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:55:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:55:15,592][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:55:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:55:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:55:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:55:18,469][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:55:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:55:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:55:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:55:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:55:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:55:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:55:23,071][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:55:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:55:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:55:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:55:25,702][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:55:26,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:55:27,019][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:55:27,676][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:55:28,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:55:29,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:55:30,254][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:55:30,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:55:32,066][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:55:33,648][__main__][INFO] - Iteration 578 took 54s (10.06% Gen, 87.03% Train). Generation: 5s, Training: 47s. Estimated remaining time: 6h 33m 8s. Estimated total time: 15h 7m 17s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 43s, 500 more iterations: 7h 33m 38s. [2026-03-25 22:55:33,651][__main__][INFO] - Starting iteration 578. [2026-03-25 22:55:33,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:55:33,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:55:38,371][__main__][INFO] - Number of regex retries in iteration 578: 0 [2026-03-25 22:55:38,372][__main__][INFO] - agents played in iteration 578 are Alice, Bob [2026-03-25 22:55:38,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:55:38,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:55:38,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:55:38,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:55:39,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:55:40,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:55:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:55:41,599][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:55:42,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:55:42,913][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:55:43,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:55:44,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:55:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:55:45,540][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:55:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:55:46,855][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:55:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:55:48,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:55:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:55:49,487][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:55:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:55:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:55:51,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:55:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:55:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:55:53,431][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:55:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:55:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:55:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:55:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:55:56,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:55:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:55:58,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:55:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:55:59,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:56:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:56:00,665][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:56:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:56:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:56:02,635][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:56:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:56:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:56:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:56:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:56:05,921][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:56:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:56:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:56:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:56:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:56:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:56:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:56:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:56:11,524][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:56:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:56:12,838][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:56:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:56:14,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:56:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:56:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:56:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:56:16,781][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:56:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:56:18,095][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:56:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:56:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:56:20,067][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:56:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:56:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:56:22,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:56:22,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:56:24,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:56:24,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:56:24,009][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:56:25,362][__main__][INFO] - Iteration 579 took 51s (9.12% Gen, 88.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 46m 45s. Estimated total time: 14h 21m 46s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 53s. [2026-03-25 22:56:25,364][__main__][INFO] - Starting iteration 579. [2026-03-25 22:56:25,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:56:25,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:56:30,221][__main__][INFO] - Number of regex retries in iteration 579: 0 [2026-03-25 22:56:30,222][__main__][INFO] - agents played in iteration 579 are Alice, Bob [2026-03-25 22:56:30,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:56:30,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:56:30,906][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:56:30,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:56:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:56:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:56:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:56:33,500][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:56:34,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:56:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:56:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:56:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:56:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:56:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:56:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:56:38,754][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:56:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:56:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:56:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:56:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:56:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:56:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:56:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:56:44,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:56:44,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:56:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:56:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:56:46,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:56:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:56:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:56:48,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:56:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:56:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:56:50,576][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:56:51,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:56:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:56:52,548][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:56:53,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:56:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:56:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:56:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:56:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:56:56,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:56:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:56:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:56:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:56:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:56:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:57:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:57:01,090][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:57:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:02,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:57:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:57:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:57:06,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:57:06,692][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:57:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:57:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:57:08,663][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:57:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:57:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:57:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:57:11,293][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:57:11,950][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:57:12,607][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:57:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:57:13,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:57:14,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:57:16,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:57:16,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:57:16,147][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:57:17,518][__main__][INFO] - Iteration 580 took 52s (9.31% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 53m 17s. Estimated total time: 14h 29m 11s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 35s. [2026-03-25 22:57:17,521][__main__][INFO] - Starting iteration 580. [2026-03-25 22:57:17,525][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:57:17,526][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:57:22,456][__main__][INFO] - Number of regex retries in iteration 580: 0 [2026-03-25 22:57:22,457][__main__][INFO] - agents played in iteration 580 are Alice, Bob [2026-03-25 22:57:22,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:57:23,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:57:23,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:57:23,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:57:23,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:57:24,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:57:24,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:57:25,630][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:57:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:57:26,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:57:27,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:57:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:57:28,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:57:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:57:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:57:30,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:57:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:57:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:57:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:57:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:57:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:57:34,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:57:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:57:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:57:36,800][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:57:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:57:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:57:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:57:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:57:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:57:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:57:41,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:57:42,054][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:57:42,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:57:43,368][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:57:44,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:57:44,682][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:57:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:57:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:57:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:57:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:57:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:57:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:57:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:57:49,938][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:57:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:57:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:57:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:57:52,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:57:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:57:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:55,544][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:57:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:57:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:57:58,172][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:57:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:57:59,485][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:00,142][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:00,799][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:58:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:58:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:58:02,769][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:58:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:58:04,083][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:58:04,740][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:58:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:58:06,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:58:06,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:58:07,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:58:07,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:58:07,890][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:58:09,284][__main__][INFO] - Iteration 581 took 51s (9.53% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 45m 56s. Estimated total time: 14h 22m 41s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 20s. [2026-03-25 22:58:09,287][__main__][INFO] - Starting iteration 581. [2026-03-25 22:58:09,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:58:09,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:58:14,448][__main__][INFO] - Number of regex retries in iteration 581: 0 [2026-03-25 22:58:14,450][__main__][INFO] - agents played in iteration 581 are Alice, Bob [2026-03-25 22:58:15,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:58:15,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:58:15,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:58:15,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:58:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:58:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:58:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:58:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:58:18,426][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:58:19,083][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:58:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:58:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:58:21,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:58:21,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:58:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:58:23,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:58:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:58:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:58:24,995][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:58:25,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:58:26,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:58:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:58:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:58:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:58:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:58:29,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:58:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:58:30,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:58:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:58:32,225][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:58:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:58:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:58:34,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:58:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:58:35,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:58:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:58:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:58:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:58:38,137][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:58:38,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:58:39,451][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:58:40,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:58:40,766][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:58:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:58:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:58:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:58:43,394][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:58:44,051][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:58:44,708][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:58:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:58:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:58:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:58:47,671][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:58:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:58:48,985][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:58:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:58:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:58:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:58:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:52,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:58:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:58:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:58:54,898][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:58:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:58:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:58:56,870][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:58:57,526][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:58:58,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:58:58,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:59:00,135][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:59:00,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:59:00,139][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:59:01,458][__main__][INFO] - Iteration 582 took 52s (9.89% Gen, 87.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 51m 51s. Estimated total time: 14h 29m 29s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 44s. [2026-03-25 22:59:01,461][__main__][INFO] - Starting iteration 582. [2026-03-25 22:59:01,465][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:59:01,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:59:06,569][__main__][INFO] - Number of regex retries in iteration 582: 0 [2026-03-25 22:59:06,570][__main__][INFO] - agents played in iteration 582 are Alice, Bob [2026-03-25 22:59:07,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:07,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:07,211][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:59:07,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:59:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:59:08,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:59:09,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:59:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:59:10,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:59:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:59:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:59:12,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:59:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:59:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:59:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:59:15,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:59:15,728][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:59:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:59:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:59:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:59:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:59:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:59:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:59:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:59:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:59:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:59:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:59:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:59:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:59:24,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:59:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:59:25,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:59:26,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:59:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:59:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:59:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:59:28,865][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:59:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:59:30,179][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:59:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:59:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:59:32,150][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:59:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:59:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:59:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:59:34,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:59:35,433][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:59:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:59:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:59:37,403][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:59:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:59:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:59:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:59:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:59:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:59:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:59:42,295][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:59:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:59:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:59:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:59:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:59:45,579][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:59:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:59:46,893][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:59:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:59:48,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:59:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:59:49,520][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:59:50,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:59:51,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:59:52,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:59:52,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:59:52,214][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:59:53,693][__main__][INFO] - Iteration 583 took 52s (9.77% Gen, 87.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 52m 0s. Estimated total time: 14h 30m 30s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 15s. [2026-03-25 22:59:53,696][__main__][INFO] - Starting iteration 583. [2026-03-25 22:59:53,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:59:53,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:59:59,144][__main__][INFO] - Number of regex retries in iteration 583: 0 [2026-03-25 22:59:59,145][__main__][INFO] - agents played in iteration 583 are Alice, Bob [2026-03-25 22:59:59,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:59,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:59,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:59:59,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:00:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:00:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:00:01,868][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:00:02,525][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:00:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:00:03,837][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:00:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:00:05,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:00:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:00:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:00:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:00:07,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:00:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:00:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:00:09,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:00:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:00:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:00:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:00:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:00:13,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:00:13,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:00:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:00:15,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:00:15,661][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:00:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:00:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:00:17,631][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:00:18,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:00:18,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:00:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:00:20,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:00:20,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:00:21,572][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:00:22,229][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:00:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:00:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:00:24,199][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:00:24,855][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:00:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:00:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:00:26,826][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:00:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:00:28,140][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:00:28,797][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:00:29,454][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:00:30,111][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:00:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:00:31,425][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:00:32,365][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:00:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:00:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:00:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:00:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:00:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:00:36,307][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:00:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:00:37,621][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:00:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:00:38,935][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:00:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:00:40,249][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:00:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:00:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:00:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:00:42,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:00:43,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:00:44,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:00:44,946][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:00:44,948][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:00:46,251][__main__][INFO] - Iteration 584 took 52s (10.30% Gen, 87.16% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 56m 30s. Estimated total time: 14h 35m 52s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 56s. [2026-03-25 23:00:46,254][__main__][INFO] - Starting iteration 584. [2026-03-25 23:00:46,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:00:46,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:00:51,270][__main__][INFO] - Number of regex retries in iteration 584: 0 [2026-03-25 23:00:51,271][__main__][INFO] - agents played in iteration 584 are Alice, Bob [2026-03-25 23:00:51,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:00:51,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:00:51,962][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:00:51,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:00:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:00:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:00:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:00:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:00:55,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:00:55,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:00:56,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:00:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:00:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:00:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:00:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:00:59,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:01:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:01:01,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:01:01,802][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:01:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:01:03,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:01:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:01:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:01:05,088][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:01:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:01:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:01:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:01:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:01:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:01:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:01:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:01:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:01:11,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:01:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:01:12,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:01:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:01:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:01:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:01:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:01:15,603][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:01:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:01:16,916][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:01:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:01:18,230][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:01:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:01:19,544][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:01:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:01:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:01:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:01:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:01:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:01:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:01:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:01:25,077][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:01:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:01:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:01:27,051][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:01:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:01:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:01:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:01:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:01:30,337][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:01:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:01:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:01:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:01:32,967][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:01:33,624][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:01:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:01:34,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:01:35,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:01:36,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:01:36,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:01:36,880][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:01:38,286][__main__][INFO] - Iteration 585 took 52s (9.63% Gen, 87.66% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 46m 55s. Estimated total time: 14h 27m 10s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 35s. [2026-03-25 23:01:38,289][__main__][INFO] - Starting iteration 585. [2026-03-25 23:01:38,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:01:38,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:01:43,540][__main__][INFO] - Number of regex retries in iteration 585: 0 [2026-03-25 23:01:43,541][__main__][INFO] - agents played in iteration 585 are Alice, Bob [2026-03-25 23:01:44,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:44,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:44,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:01:44,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:01:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:01:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:01:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:01:46,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:01:47,650][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:01:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:01:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:01:49,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:01:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:01:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:01:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:01:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:01:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:01:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:01:54,219][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:01:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:01:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:01:56,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:01:56,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:01:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:01:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:01:58,819][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:01:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:02:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:02:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:02:01,446][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:02:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:02:02,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:02:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:02:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:02:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:02:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:02:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:02:06,702][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:02:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:02:08,016][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:02:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:02:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:02:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:02:10,644][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:02:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:02:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:02:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:02:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:02:13,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:02:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:02:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:02:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:02:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:02:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:02:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:02:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:02:19,434][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:02:20,091][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:02:20,748][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:02:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:02:22,062][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:02:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:02:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:02:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:02:24,690][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:02:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:02:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:02:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:02:27,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:02:28,098][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:02:29,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:02:29,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:02:29,214][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:02:30,662][__main__][INFO] - Iteration 586 took 52s (10.02% Gen, 87.21% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 51m 43s. Estimated total time: 14h 32m 50s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 25s. [2026-03-25 23:02:30,664][__main__][INFO] - Starting iteration 586. [2026-03-25 23:02:30,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:02:30,670][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:02:35,459][__main__][INFO] - Number of regex retries in iteration 586: 0 [2026-03-25 23:02:35,461][__main__][INFO] - agents played in iteration 586 are Alice, Bob [2026-03-25 23:02:36,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:02:36,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:02:36,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:02:36,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:02:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:02:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:02:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:02:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:02:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:02:40,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:02:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:02:41,356][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:02:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:02:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:02:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:02:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:02:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:02:45,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:02:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:02:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:02:47,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:02:47,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:02:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:02:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:02:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:02:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:02:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:02:51,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:02:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:02:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:02:53,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:02:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:02:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:02:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:02:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:02:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:02:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:02:58,513][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:02:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:02:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:03:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:03:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:03:01,844][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:03:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:03:03,176][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:03:03,843][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:03:04,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:03:05,169][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:03:05,829][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:03:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:03:07,156][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:03:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:03:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:03:09,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:03:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:03:11,069][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:03:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:03:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:03:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:03:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:03:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:03:15,034][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:03:15,694][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:03:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:03:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:03:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:03:18,325][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:03:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:03:19,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:03:20,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:03:21,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:03:21,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:03:21,725][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:03:23,231][__main__][INFO] - Iteration 587 took 52s (9.11% Gen, 88.01% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 54m 5s. Estimated total time: 14h 36m 4s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 2s. [2026-03-25 23:03:23,236][__main__][INFO] - Starting iteration 587. [2026-03-25 23:03:23,240][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:03:23,241][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:03:28,753][__main__][INFO] - Number of regex retries in iteration 587: 0 [2026-03-25 23:03:28,756][__main__][INFO] - agents played in iteration 587 are Alice, Bob [2026-03-25 23:03:29,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:03:29,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:03:29,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:03:29,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:03:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:03:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:03:31,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:03:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:03:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:03:33,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:03:34,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:03:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:03:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:03:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:03:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:03:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:03:38,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:03:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:03:39,622][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:03:40,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:03:40,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:03:41,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:03:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:03:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:03:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:03:44,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:03:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:03:45,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:03:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:03:46,887][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:03:47,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:03:48,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:03:48,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:03:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:03:50,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:03:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:03:51,505][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:03:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:03:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:03:53,481][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:03:54,139][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:03:54,799][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:03:55,457][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:03:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:03:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:03:57,436][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:03:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:03:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:03:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:04:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:04:00,745][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:04:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:04:02,602][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:04:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:04:03,921][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:04:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:04:05,241][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:04:05,901][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:04:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:04:07,221][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:04:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:04:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:04:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:04:09,861][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:04:10,520][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:04:11,181][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:04:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:04:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:04:13,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:04:14,313][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:04:15,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:04:15,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:04:15,483][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:04:17,291][__main__][INFO] - Iteration 588 took 54s (10.20% Gen, 86.45% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 17m 58s. Estimated total time: 15h 0m 52s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 5s, 500 more iterations: 7h 30m 26s. [2026-03-25 23:04:17,293][__main__][INFO] - Starting iteration 588. [2026-03-25 23:04:17,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:04:17,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:04:22,526][__main__][INFO] - Number of regex retries in iteration 588: 0 [2026-03-25 23:04:22,527][__main__][INFO] - agents played in iteration 588 are Alice, Bob [2026-03-25 23:04:23,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:04:23,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:04:23,102][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:04:23,102][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:04:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:04:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:04:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:04:25,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:04:26,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:04:27,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:04:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:04:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:04:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:04:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:04:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:04:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:04:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:04:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:04:33,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:04:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:04:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:04:35,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:04:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:04:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:04:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:04:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:04:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:04:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:04:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:04:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:04:41,191][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:04:41,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:04:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:04:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:04:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:04:44,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:04:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:04:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:04:46,472][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:04:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:04:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:04:48,446][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:04:49,103][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:04:49,766][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:04:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:04:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:04:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:04:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:04:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:04:53,708][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:04:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:04:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:04:56,174][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:04:56,829][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:04:57,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:04:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:04:58,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:04:59,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:05:00,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:05:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:05:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:05:02,089][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:05:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:05:03,406][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:05:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:05:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:05:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:05:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:05:06,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:05:07,637][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:05:08,859][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:05:08,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:05:08,864][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:05:10,205][__main__][INFO] - Iteration 589 took 52s (9.88% Gen, 87.58% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 58m 3s. Estimated total time: 14h 41m 49s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 54s. [2026-03-25 23:05:10,208][__main__][INFO] - Starting iteration 589. [2026-03-25 23:05:10,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:05:10,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:05:15,628][__main__][INFO] - Number of regex retries in iteration 589: 0 [2026-03-25 23:05:15,630][__main__][INFO] - agents played in iteration 589 are Alice, Bob [2026-03-25 23:05:16,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:05:16,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:05:16,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:05:16,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:05:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:05:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:05:18,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:05:19,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:05:19,891][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:05:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:05:21,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:05:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:05:22,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:05:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:05:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:05:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:05:25,196][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:05:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:05:26,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:05:27,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:05:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:05:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:05:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:05:29,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:05:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:05:31,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:05:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:05:32,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:05:33,119][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:05:33,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:05:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:05:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:05:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:05:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:05:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:05:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:05:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:05:39,047][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:05:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:05:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:05:41,021][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:05:41,679][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:05:42,342][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:05:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:05:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:05:44,314][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:05:44,971][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:05:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:05:46,284][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:05:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:05:47,599][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:05:48,255][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:05:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:05:49,899][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:05:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:05:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:05:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:05:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:05:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:05:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:05:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:05:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:05:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:05:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:05:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:05:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:05:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:05:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:05:59,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:06:00,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:06:01,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:06:01,989][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:06:01,991][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:06:03,498][__main__][INFO] - Iteration 590 took 53s (10.16% Gen, 87.00% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 3m 27s. Estimated total time: 14h 48m 7s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 3s. [2026-03-25 23:06:03,501][__main__][INFO] - Starting iteration 590. [2026-03-25 23:06:03,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:06:03,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:06:08,528][__main__][INFO] - Number of regex retries in iteration 590: 0 [2026-03-25 23:06:08,530][__main__][INFO] - agents played in iteration 590 are Alice, Bob [2026-03-25 23:06:09,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:06:09,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:06:09,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:06:09,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:06:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:06:10,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:06:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:06:12,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:06:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:06:13,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:06:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:06:14,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:06:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:06:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:06:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:06:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:06:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:06:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:06:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:06:20,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:06:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:06:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:06:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:06:22,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:06:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:06:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:06:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:06:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:06:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:06:26,826][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:06:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:06:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:06:28,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:06:29,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:06:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:06:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:06:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:06:32,097][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:06:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:06:33,414][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:06:34,072][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:06:34,730][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:06:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:06:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:06:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:06:37,366][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:06:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:06:38,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:06:39,342][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:06:39,999][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:06:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:06:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:06:42,353][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:06:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:06:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:06:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:06:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:06:45,638][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:06:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:06:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:06:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:06:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:06:48,931][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:06:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:06:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:06:50,905][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:06:51,564][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:06:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:06:52,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:06:53,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:06:55,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:06:55,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:06:55,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:06:56,364][__main__][INFO] - Iteration 591 took 52s (9.50% Gen, 88.00% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 55m 28s. Estimated total time: 14h 41m 1s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 6s, 500 more iterations: 7h 20m 30s. [2026-03-25 23:06:56,367][__main__][INFO] - Starting iteration 591. [2026-03-25 23:06:56,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:06:56,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:07:01,776][__main__][INFO] - Number of regex retries in iteration 591: 0 [2026-03-25 23:07:01,777][__main__][INFO] - agents played in iteration 591 are Alice, Bob [2026-03-25 23:07:02,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:02,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:02,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:07:02,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:07:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:07:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:07:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:07:05,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:07:06,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:07:06,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:07:07,376][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:07:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:07:08,705][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:07:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:07:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:07:10,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:07:11,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:07:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:07:12,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:07:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:07:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:07:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:07:15,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:07:16,016][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:07:16,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:07:17,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:07:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:07:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:07:19,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:07:20,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:07:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:07:21,351][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:07:22,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:07:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:07:23,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:07:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:07:24,665][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:07:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:07:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:07:26,654][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:07:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:07:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:07:28,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:07:29,294][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:07:29,953][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:07:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:07:31,268][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:07:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:07:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:07:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:07:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:07:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:07:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:07:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:07:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:07:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:07:38,293][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:07:38,951][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:07:39,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:07:40,268][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:07:40,927][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:07:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:07:42,243][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:07:42,900][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:07:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:07:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:07:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:07:45,531][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:07:46,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:07:47,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:07:48,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:07:48,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:07:48,277][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:07:49,696][__main__][INFO] - Iteration 592 took 53s (10.13% Gen, 87.20% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 2m 20s. Estimated total time: 14h 48m 46s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 52s, 500 more iterations: 7h 24m 23s. [2026-03-25 23:07:49,699][__main__][INFO] - Starting iteration 592. [2026-03-25 23:07:49,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:07:49,703][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:07:54,926][__main__][INFO] - Number of regex retries in iteration 592: 0 [2026-03-25 23:07:54,929][__main__][INFO] - agents played in iteration 592 are Alice, Bob [2026-03-25 23:07:55,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:55,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:55,637][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:07:55,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:07:56,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:07:57,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:07:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:07:58,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:07:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:07:59,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:08:00,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:08:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:08:01,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:08:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:08:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:08:03,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:08:04,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:08:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:08:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:08:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:08:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:08:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:08:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:08:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:08:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:08:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:08:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:08:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:08:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:08:13,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:08:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:08:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:08:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:08:15,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:08:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:08:17,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:08:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:08:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:08:19,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:08:19,763][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:08:20,419][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:08:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:08:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:08:22,389][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:08:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:08:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:08:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:08:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:08:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:08:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:08:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:08:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:08:28,718][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:08:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:08:30,032][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:08:30,689][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:08:31,346][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:08:32,003][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:08:32,660][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:08:33,316][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:08:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:08:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:08:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:08:35,943][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:08:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:08:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:08:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:08:38,573][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:08:39,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:08:40,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:08:41,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:08:41,234][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:08:41,235][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:08:42,529][__main__][INFO] - Iteration 593 took 52s (9.89% Gen, 87.65% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 53m 10s. Estimated total time: 14h 40m 28s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 14s. [2026-03-25 23:08:42,532][__main__][INFO] - Starting iteration 593. [2026-03-25 23:08:42,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:08:42,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:08:47,506][__main__][INFO] - Number of regex retries in iteration 593: 0 [2026-03-25 23:08:47,508][__main__][INFO] - agents played in iteration 593 are Alice, Bob [2026-03-25 23:08:48,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:08:48,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:08:48,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:08:48,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:08:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:08:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:08:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:08:50,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:08:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:08:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:08:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:08:53,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:08:53,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:08:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:08:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:08:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:08:56,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:08:57,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:08:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:08:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:08:59,258][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:08:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:09:00,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:09:01,229][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:09:01,887][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:09:02,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:09:03,201][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:09:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:09:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:09:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:09:05,829][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:09:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:09:07,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:09:07,802][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:09:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:09:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:09:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:09:10,430][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:09:11,088][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:09:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:09:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:09:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:09:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:09:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:09:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:09:15,687][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:09:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:09:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:09:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:09:18,316][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:09:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:09:19,631][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:09:20,623][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:09:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:09:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:09:22,593][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:09:23,250][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:09:23,907][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:09:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:09:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:09:25,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:09:26,536][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:09:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:09:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:09:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:09:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:09:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:09:30,478][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:09:31,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:09:31,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:09:33,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:09:33,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:09:33,019][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:09:34,450][__main__][INFO] - Iteration 594 took 51s (9.58% Gen, 87.66% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 37m 5s. Estimated total time: 14h 25m 16s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 38s. [2026-03-25 23:09:34,453][__main__][INFO] - Starting iteration 594. [2026-03-25 23:09:34,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:09:34,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:09:42,619][__main__][INFO] - Number of regex retries in iteration 594: 0 [2026-03-25 23:09:42,620][__main__][INFO] - agents played in iteration 594 are Alice, Bob [2026-03-25 23:09:43,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:43,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:43,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:09:43,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:09:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:09:44,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:09:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:09:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:09:46,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:09:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:09:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:09:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:09:49,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:09:49,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:09:50,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:09:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:09:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:09:52,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:09:53,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:09:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:09:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:09:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:09:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:09:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:09:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:09:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:09:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:09:59,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:09:59,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:10:00,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:10:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:10:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:10:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:10:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:10:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:10:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:10:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:10:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:10:06,271][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:10:06,927][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:10:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:10:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:10:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:10:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:10:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:10:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:10:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:10:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:10:12,838][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:10:13,494][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:10:14,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:10:14,807][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:10:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:10:16,453][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:10:17,110][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:10:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:10:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:10:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:10:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:10:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:10:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:10:21,709][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:10:22,366][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:10:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:10:23,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:10:24,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:10:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:10:25,653][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:10:26,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:10:27,118][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:10:28,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:10:28,345][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:10:28,347][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:10:29,841][__main__][INFO] - Iteration 595 took 55s (14.74% Gen, 82.56% Train). Generation: 8s, Training: 45s. Estimated remaining time: 6h 33m 59s. Estimated total time: 15h 23m 5s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 32s. [2026-03-25 23:10:29,844][__main__][INFO] - Starting iteration 595. [2026-03-25 23:10:29,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:10:29,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:10:34,624][__main__][INFO] - Number of regex retries in iteration 595: 0 [2026-03-25 23:10:34,626][__main__][INFO] - agents played in iteration 595 are Alice, Bob [2026-03-25 23:10:35,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:10:35,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:10:35,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:10:35,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:10:35,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:10:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:10:37,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:10:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:10:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:10:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:10:39,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:10:40,436][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:10:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:10:41,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:10:42,406][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:10:43,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:10:43,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:10:44,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:10:45,032][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:10:45,689][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:10:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:10:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:10:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:10:48,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:10:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:10:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:10:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:10:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:10:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:10:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:10:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:10:53,569][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:10:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:10:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:10:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:10:56,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:10:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:10:57,511][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:10:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:10:58,824][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:10:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:11:00,137][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:11:00,794][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:11:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:11:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:11:02,764][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:11:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:11:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:11:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:11:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:11:06,046][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:11:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:11:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:11:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:11:09,018][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:11:09,675][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:11:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:11:10,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:11:11,644][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:11:12,301][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:11:12,957][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:11:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:11:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:11:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:11:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:11:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:11:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:11:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:11:18,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:11:18,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:11:20,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:11:20,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:11:20,054][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:11:21,395][__main__][INFO] - Iteration 596 took 51s (9.27% Gen, 88.12% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 29m 12s. Estimated total time: 14h 19m 9s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 54s, 500 more iterations: 7h 9m 34s. [2026-03-25 23:11:21,398][__main__][INFO] - Starting iteration 596. [2026-03-25 23:11:21,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:11:21,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:11:26,521][__main__][INFO] - Number of regex retries in iteration 596: 0 [2026-03-25 23:11:26,522][__main__][INFO] - agents played in iteration 596 are Alice, Bob [2026-03-25 23:11:27,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:11:27,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:11:27,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:11:27,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:11:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:11:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:11:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:11:29,957][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:11:30,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:11:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:11:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:11:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:11:33,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:11:33,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:11:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:11:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:11:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:11:36,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:11:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:11:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:11:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:11:39,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:11:39,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:11:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:11:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:11:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:11:42,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:11:43,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:11:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:11:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:11:45,054][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:11:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:11:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:11:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:11:47,680][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:11:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:11:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:11:49,651][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:11:50,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:11:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:11:51,620][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:11:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:11:52,934][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:11:53,590][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:11:54,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:11:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:11:55,561][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:11:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:11:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:11:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:11:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:11:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:11:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:12:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:12:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:12:01,808][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:12:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:12:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:12:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:12:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:12:05,091][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:12:05,748][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:12:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:12:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:12:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:12:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:12:09,032][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:12:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:12:10,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:12:11,117][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:12:12,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:12:12,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:12:12,222][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:12:13,649][__main__][INFO] - Iteration 597 took 52s (9.80% Gen, 87.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 39m 59s. Estimated total time: 14h 30m 49s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 24s. [2026-03-25 23:12:13,652][__main__][INFO] - Starting iteration 597. [2026-03-25 23:12:13,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:12:13,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:12:18,671][__main__][INFO] - Number of regex retries in iteration 597: 0 [2026-03-25 23:12:18,673][__main__][INFO] - agents played in iteration 597 are Alice, Bob [2026-03-25 23:12:19,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:12:19,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:12:19,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:12:19,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:12:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:12:20,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:12:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:12:21,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:12:22,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:12:23,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:12:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:12:24,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:12:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:12:25,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:12:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:12:27,191][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:12:27,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:12:28,504][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:12:29,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:12:29,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:12:30,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:12:31,131][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:12:31,786][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:12:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:12:33,099][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:12:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:12:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:12:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:12:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:12:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:12:37,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:12:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:12:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:12:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:12:39,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:12:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:12:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:12:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:12:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:12:42,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:12:43,602][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:12:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:12:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:12:45,571][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:12:46,228][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:12:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:12:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:12:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:12:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:12:49,512][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:12:50,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:12:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:12:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:12:52,428][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:12:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:12:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:12:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:12:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:12:55,711][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:12:56,367][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:12:57,024][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:12:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:12:58,338][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:12:58,994][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:12:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:13:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:13:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:13:01,621][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:13:02,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:13:03,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:13:04,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:13:04,102][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:13:04,104][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:13:05,509][__main__][INFO] - Iteration 598 took 51s (9.67% Gen, 87.61% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 32m 32s. Estimated total time: 14h 24m 14s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 7s. [2026-03-25 23:13:05,511][__main__][INFO] - Starting iteration 598. [2026-03-25 23:13:05,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:13:05,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:13:10,568][__main__][INFO] - Number of regex retries in iteration 598: 0 [2026-03-25 23:13:10,570][__main__][INFO] - agents played in iteration 598 are Alice, Bob [2026-03-25 23:13:11,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:13:11,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:13:11,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:13:11,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:13:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:13:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:13:13,311][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:13:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:13:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:13:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:13:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:13:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:13:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:13:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:13:18,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:13:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:13:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:13:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:13:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:13:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:13:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:13:23,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:13:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:13:24,482][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:13:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:13:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:13:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:13:27,111][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:13:27,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:13:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:13:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:13:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:13:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:13:31,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:13:31,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:13:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:13:33,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:13:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:13:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:13:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:13:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:13:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:13:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:13:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:13:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:13:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:13:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:13:40,247][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:13:40,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:13:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:13:42,219][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:13:42,876][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:13:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:13:44,537][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:13:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:13:45,850][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:13:46,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:13:47,164][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:13:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:13:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:13:49,168][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:13:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:13:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:13:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:13:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:13:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:13:53,150][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:13:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:13:54,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:13:55,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:13:56,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:13:56,430][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:13:56,432][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:13:57,818][__main__][INFO] - Iteration 599 took 52s (9.66% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 39m 10s. Estimated total time: 14h 31m 44s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 52s. [2026-03-25 23:13:57,821][__main__][INFO] - Starting iteration 599. [2026-03-25 23:13:57,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:13:57,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:14:02,964][__main__][INFO] - Number of regex retries in iteration 599: 0 [2026-03-25 23:14:02,966][__main__][INFO] - agents played in iteration 599 are Alice, Bob [2026-03-25 23:14:03,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:03,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:03,544][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:14:03,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:14:04,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:14:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:14:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:14:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:14:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:14:07,514][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:14:08,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:14:08,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:14:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:14:10,140][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:14:10,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:14:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:14:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:14:12,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:14:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:14:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:14:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:14:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:14:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:14:16,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:14:17,364][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:14:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:14:18,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:14:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:14:19,991][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:14:20,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:14:21,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:14:21,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:14:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:14:23,276][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:14:23,933][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:14:24,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:14:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:14:25,903][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:14:26,561][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:14:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:14:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:14:28,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:14:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:14:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:14:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:14:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:14:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:14:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:14:33,127][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:14:33,783][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:14:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:14:35,097][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:14:36,043][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:14:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:14:37,360][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:14:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:14:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:14:39,331][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:14:39,987][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:14:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:14:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:14:41,957][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:14:42,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:14:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:14:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:14:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:14:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:14:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:14:46,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:14:47,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:14:48,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:14:48,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:14:48,440][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:14:49,796][__main__][INFO] - Iteration 600 took 51s (9.89% Gen, 87.49% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 32m 47s. Estimated total time: 14h 26m 13s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 37s, 500 more iterations: 7h 13m 6s. [2026-03-25 23:14:49,799][__main__][INFO] - Starting iteration 600. [2026-03-25 23:14:49,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:14:49,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:14:54,545][__main__][INFO] - Number of regex retries in iteration 600: 0 [2026-03-25 23:14:54,546][__main__][INFO] - agents played in iteration 600 are Alice, Bob [2026-03-25 23:14:55,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:55,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:55,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:14:55,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:14:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:14:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:14:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:14:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:14:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:14:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:14:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:15:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:15:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:15:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:15:02,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:15:03,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:15:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:15:04,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:15:05,035][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:15:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:15:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:15:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:15:07,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:15:08,318][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:15:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:15:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:15:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:15:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:15:11,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:15:12,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:15:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:15:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:15:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:15:14,884][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:15:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:15:16,197][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:15:16,854][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:15:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:15:18,167][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:15:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:15:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:15:20,137][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:15:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:15:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:15:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:15:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:15:23,421][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:15:24,077][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:15:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:15:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:15:26,048][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:15:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:15:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:15:28,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:15:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:15:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:15:30,323][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:15:30,979][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:15:31,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:15:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:15:32,949][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:15:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:15:34,263][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:15:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:15:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:15:36,233][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:15:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:15:37,547][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:15:38,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:15:39,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:15:40,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:15:40,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:15:40,286][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:15:42,831][__main__][INFO] - Iteration 601 took 53s (8.94% Gen, 86.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 49m 30s. Estimated total time: 14h 43m 49s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 22s, 500 more iterations: 7h 21m 54s. [2026-03-25 23:15:42,833][__main__][INFO] - Starting iteration 601. [2026-03-25 23:15:42,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:15:42,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:15:47,569][__main__][INFO] - Number of regex retries in iteration 601: 0 [2026-03-25 23:15:47,571][__main__][INFO] - agents played in iteration 601 are Alice, Bob [2026-03-25 23:15:48,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:15:48,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:15:48,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:15:48,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:15:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:15:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:15:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:15:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:15:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:15:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:15:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:15:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:15:54,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:15:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:15:55,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:15:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:15:56,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:15:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:15:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:15:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:15:59,329][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:15:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:16:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:16:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:16:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:16:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:16:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:16:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:16:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:16:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:16:05,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:16:06,552][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:16:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:16:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:16:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:16:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:16:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:16:10,492][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:16:11,149][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:16:11,806][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:16:12,462][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:16:13,119][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:16:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:16:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:16:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:16:15,745][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:16:16,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:16:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:16:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:16:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:16:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:16:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:16:20,646][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:16:21,303][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:16:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:16:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:16:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:16:23,930][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:16:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:16:25,244][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:16:25,901][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:16:26,558][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:16:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:16:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:16:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:16:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:16:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:16:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:16:31,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:16:31,870][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:16:33,080][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:16:33,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:16:33,085][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:16:34,336][__main__][INFO] - Iteration 602 took 51s (9.19% Gen, 88.38% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 23m 10s. Estimated total time: 14h 18m 21s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 50s, 500 more iterations: 7h 9m 10s. [2026-03-25 23:16:34,339][__main__][INFO] - Starting iteration 602. [2026-03-25 23:16:34,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:16:34,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:16:39,205][__main__][INFO] - Number of regex retries in iteration 602: 0 [2026-03-25 23:16:39,207][__main__][INFO] - agents played in iteration 602 are Alice, Bob [2026-03-25 23:16:39,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:16:39,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:16:39,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:16:39,860][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:16:40,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:16:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:16:41,803][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:16:42,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:16:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:16:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:16:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:16:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:16:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:16:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:16:47,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:16:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:16:48,368][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:16:49,024][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:16:49,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:16:50,338][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:16:50,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:16:51,651][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:16:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:16:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:16:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:16:54,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:16:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:16:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:16:56,247][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:16:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:16:57,559][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:16:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:16:58,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:16:59,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:17:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:17:00,842][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:17:01,499][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:17:02,156][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:17:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:17:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:17:04,126][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:17:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:17:05,439][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:17:06,096][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:17:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:17:07,409][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:17:08,066][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:17:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:17:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:17:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:17:10,692][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:17:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:17:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:17:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:17:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:17:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:17:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:17:15,538][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:17:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:17:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:17:17,508][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:17:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:17:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:17:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:17:20,135][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:17:20,791][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:17:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:17:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:17:22,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:17:23,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:17:24,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:17:24,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:17:24,725][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:17:26,099][__main__][INFO] - Iteration 603 took 51s (9.39% Gen, 87.94% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 26m 35s. Estimated total time: 14h 22m 37s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 15s, 500 more iterations: 7h 11m 18s. [2026-03-25 23:17:26,101][__main__][INFO] - Starting iteration 603. [2026-03-25 23:17:26,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:17:26,107][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:17:30,776][__main__][INFO] - Number of regex retries in iteration 603: 0 [2026-03-25 23:17:30,778][__main__][INFO] - agents played in iteration 603 are Alice, Bob [2026-03-25 23:17:31,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:31,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:31,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:17:31,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:17:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:17:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:17:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:17:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:17:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:17:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:17:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:17:36,627][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:17:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:17:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:17:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:17:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:17:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:17:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:17:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:17:41,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:17:42,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:17:43,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:17:43,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:17:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:17:45,166][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:17:45,823][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:17:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:17:47,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:17:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:17:48,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:17:49,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:17:49,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:17:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:17:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:17:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:17:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:17:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:17:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:17:54,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:17:55,015][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:17:55,672][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:17:56,329][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:17:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:17:57,642][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:17:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:17:58,956][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:17:59,612][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:18:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:18:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:18:01,582][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:18:02,238][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:18:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:18:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:18:04,524][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:18:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:18:05,837][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:18:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:18:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:18:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:18:08,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:18:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:18:09,781][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:18:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:18:11,095][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:18:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:18:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:18:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:18:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:18:14,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:18:15,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:18:16,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:18:16,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:18:16,342][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:18:17,624][__main__][INFO] - Iteration 604 took 51s (9.07% Gen, 88.44% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 21m 46s. Estimated total time: 14h 18m 40s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 52s, 500 more iterations: 7h 9m 20s. [2026-03-25 23:18:17,627][__main__][INFO] - Starting iteration 604. [2026-03-25 23:18:17,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:18:17,631][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:18:22,398][__main__][INFO] - Number of regex retries in iteration 604: 0 [2026-03-25 23:18:22,399][__main__][INFO] - agents played in iteration 604 are Alice, Bob [2026-03-25 23:18:23,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:18:23,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:18:23,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:18:23,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:18:23,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:18:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:18:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:18:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:18:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:18:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:18:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:18:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:18:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:18:29,806][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:18:30,463][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:18:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:18:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:18:32,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:18:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:18:33,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:18:34,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:18:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:18:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:18:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:18:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:18:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:18:38,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:18:39,003][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:18:39,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:18:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:18:40,974][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:18:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:18:42,288][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:18:42,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:18:43,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:18:44,259][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:18:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:18:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:18:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:18:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:18:47,543][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:18:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:18:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:18:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:18:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:18:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:18:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:18:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:18:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:18:53,455][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:18:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:18:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:18:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:18:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:18:56,989][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:18:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:18:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:18:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:18:59,616][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:19:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:19:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:19:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:19:02,243][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:19:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:19:03,557][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:19:04,213][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:19:04,870][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:19:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:19:06,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:19:06,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 23:19:07,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:19:07,994][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:19:07,995][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:19:09,343][__main__][INFO] - Iteration 605 took 51s (9.22% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 24m 9s. Estimated total time: 14h 21m 55s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 57s. [2026-03-25 23:19:09,346][__main__][INFO] - Starting iteration 605. [2026-03-25 23:19:09,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:19:09,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:19:14,193][__main__][INFO] - Number of regex retries in iteration 605: 0 [2026-03-25 23:19:14,194][__main__][INFO] - agents played in iteration 605 are Alice, Bob [2026-03-25 23:19:14,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:19:14,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:19:14,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:19:14,906][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:19:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:19:16,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:19:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:19:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:19:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:19:18,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:19:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:19:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:19:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:19:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:19:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:19:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:19:23,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:19:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:19:24,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:19:25,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:19:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:19:26,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:19:27,408][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:19:28,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:19:28,721][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:19:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:19:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:19:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:19:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:19:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:19:32,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:19:33,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:19:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:19:34,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:19:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:19:35,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:19:36,599][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:19:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:19:37,913][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:19:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:19:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:19:39,882][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:19:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:19:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:19:41,852][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:19:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:19:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:19:43,822][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:19:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:19:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:19:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:19:46,448][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:19:47,383][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:19:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:19:48,698][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:19:49,354][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:19:50,013][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:19:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:19:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:19:51,982][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:19:52,639][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:19:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:19:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:19:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:19:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:19:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:19:56,582][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:19:57,238][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:19:57,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:19:58,626][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:19:59,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:19:59,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:19:59,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:20:01,075][__main__][INFO] - Iteration 606 took 51s (9.36% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 23m 30s. Estimated total time: 14h 22m 7s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 12s, 500 more iterations: 7h 11m 3s. [2026-03-25 23:20:01,078][__main__][INFO] - Starting iteration 606. [2026-03-25 23:20:01,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:20:01,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:20:06,212][__main__][INFO] - Number of regex retries in iteration 606: 0 [2026-03-25 23:20:06,213][__main__][INFO] - agents played in iteration 606 are Alice, Bob [2026-03-25 23:20:06,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:06,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:06,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:20:06,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:20:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:20:08,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:20:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:20:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:20:10,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:20:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:20:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:20:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:20:12,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:20:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:20:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:20:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:20:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:20:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:20:16,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:20:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:20:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:20:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:20:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:20:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:20:20,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:20:21,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:20:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:20:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:20:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:20:24,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:20:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:20:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:20:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:20:26,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:20:27,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:20:28,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:20:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:20:29,323][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:20:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:20:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:20:31,292][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:20:31,949][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:20:32,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:20:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:20:33,919][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:20:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:20:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:20:35,889][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:20:36,546][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:20:37,202][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:20:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:20:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:20:39,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:20:40,081][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:20:40,738][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:20:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:20:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:20:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:20:43,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:20:44,023][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:20:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:20:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:20:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:20:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:20:47,307][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:20:47,965][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:20:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:20:49,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:20:49,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:20:50,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 23:20:51,815][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:20:51,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:20:51,823][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:20:53,340][__main__][INFO] - Iteration 607 took 52s (9.82% Gen, 87.28% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 31m 30s. Estimated total time: 14h 30m 59s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 29s. [2026-03-25 23:20:53,349][__main__][INFO] - Starting iteration 607. [2026-03-25 23:20:53,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:20:53,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:20:58,273][__main__][INFO] - Number of regex retries in iteration 607: 0 [2026-03-25 23:20:58,274][__main__][INFO] - agents played in iteration 607 are Alice, Bob [2026-03-25 23:20:58,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:58,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:58,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:20:58,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:20:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:21:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:21:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:21:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:21:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:21:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:21:03,467][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:21:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:21:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:21:05,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:21:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:21:06,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:21:07,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:21:08,066][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:21:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:21:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:21:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:21:10,692][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:21:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:21:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:21:12,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:21:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:21:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:21:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:21:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:21:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:21:16,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:21:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:21:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:21:18,572][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:21:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:21:19,886][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:21:20,543][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:21:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:21:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:21:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:21:23,170][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:21:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:21:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:21:25,140][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:21:25,798][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:21:26,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:21:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:21:27,771][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:21:28,427][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:21:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:21:29,741][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:21:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:21:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:21:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:21:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:21:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:21:34,001][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:21:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:21:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:21:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:21:36,636][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:21:37,294][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:21:37,952][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:21:38,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:21:39,269][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:21:39,927][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:21:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:21:41,246][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:21:41,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:21:42,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:21:43,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:21:43,954][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:21:43,956][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:21:45,409][__main__][INFO] - Iteration 608 took 52s (9.43% Gen, 87.77% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 27m 5s. Estimated total time: 14h 27m 27s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-25 23:21:45,412][__main__][INFO] - Starting iteration 608. [2026-03-25 23:21:45,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:21:45,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:21:52,199][__main__][INFO] - Number of regex retries in iteration 608: 0 [2026-03-25 23:21:52,200][__main__][INFO] - agents played in iteration 608 are Alice, Bob [2026-03-25 23:21:52,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:21:52,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:21:52,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:21:52,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:21:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:21:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:21:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:21:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:21:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:21:56,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:21:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:21:58,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:21:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:21:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:22:00,156][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:22:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:22:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:22:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:22:02,789][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:22:03,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:22:04,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:22:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:22:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:22:06,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:22:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:22:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:22:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:22:08,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:22:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:22:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:22:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:22:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:22:11,993][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:22:12,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:22:13,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:22:13,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:22:14,621][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:22:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:22:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:22:16,594][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:22:17,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:22:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:22:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:22:19,223][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:22:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:22:20,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:22:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:22:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:22:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:22:23,164][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:22:23,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:22:24,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:22:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:22:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:22:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:22:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:22:28,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:22:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:22:29,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:22:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:22:30,725][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:22:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:22:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:22:32,695][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:22:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:22:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:22:34,665][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:22:35,322][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:22:35,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:22:36,771][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:22:37,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:22:37,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:22:37,914][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:22:39,442][__main__][INFO] - Iteration 609 took 54s (12.56% Gen, 84.61% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 59m 12s. Estimated total time: 15h 0m 27s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 2s, 500 more iterations: 7h 30m 13s. [2026-03-25 23:22:39,445][__main__][INFO] - Starting iteration 609. [2026-03-25 23:22:39,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:22:39,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:22:44,499][__main__][INFO] - Number of regex retries in iteration 609: 0 [2026-03-25 23:22:44,501][__main__][INFO] - agents played in iteration 609 are Alice, Bob [2026-03-25 23:22:45,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:22:45,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:22:45,569][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:22:45,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:22:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:22:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:22:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:22:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:22:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:22:49,509][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:22:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:22:50,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:22:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:22:52,135][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:22:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:22:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:22:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:22:54,764][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:22:55,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:22:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:22:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:22:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:22:58,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:22:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:22:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:23:00,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:23:00,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:23:01,332][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:23:01,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:23:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:23:03,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:23:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:23:04,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:23:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:23:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:23:06,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:23:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:23:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:23:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:23:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:23:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:23:10,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:23:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:23:11,842][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:23:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:23:13,155][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:23:13,812][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:23:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:23:15,125][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:23:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:23:16,438][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:23:17,095][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:23:18,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:23:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:23:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:23:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:23:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:23:21,369][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:23:22,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:23:22,682][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:23:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:23:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:23:24,655][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:23:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:23:25,970][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:23:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:23:27,284][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:23:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:23:28,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:23:29,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:23:31,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:23:31,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:23:31,070][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:23:32,459][__main__][INFO] - Iteration 610 took 53s (9.53% Gen, 87.85% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 41m 23s. Estimated total time: 14h 43m 32s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 21s, 500 more iterations: 7h 21m 46s. [2026-03-25 23:23:32,461][__main__][INFO] - Starting iteration 610. [2026-03-25 23:23:32,465][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:23:32,465][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:23:37,663][__main__][INFO] - Number of regex retries in iteration 610: 0 [2026-03-25 23:23:37,665][__main__][INFO] - agents played in iteration 610 are Alice, Bob [2026-03-25 23:23:38,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:23:38,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:23:38,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:23:38,288][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:23:39,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:23:39,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:23:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:23:40,977][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:23:41,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:23:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:23:42,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:23:43,602][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:23:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:23:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:23:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:23:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:23:46,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:23:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:23:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:23:48,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:23:49,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:23:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:23:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:23:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:23:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:23:52,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:23:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:23:54,114][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:23:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:23:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:23:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:23:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:23:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:23:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:23:58,712][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:23:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:24:00,026][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:24:00,682][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:24:01,338][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:24:01,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:24:02,652][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:24:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:24:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:24:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:24:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:24:05,936][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:24:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:24:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:24:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:24:08,563][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:24:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:24:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:24:10,867][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:24:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:24:12,183][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:24:12,838][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:24:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:24:14,152][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:24:14,809][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:24:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:24:16,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:24:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:24:17,443][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:24:18,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:24:18,765][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:24:19,428][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:24:20,091][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:24:20,750][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:24:21,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:24:22,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:24:24,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:24:24,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:24:24,090][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:24:25,540][__main__][INFO] - Iteration 611 took 53s (9.80% Gen, 87.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 41m 34s. Estimated total time: 14h 44m 36s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 27s, 500 more iterations: 7h 22m 18s. [2026-03-25 23:24:25,542][__main__][INFO] - Starting iteration 611. [2026-03-25 23:24:25,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:24:25,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:24:31,595][__main__][INFO] - Number of regex retries in iteration 611: 0 [2026-03-25 23:24:31,597][__main__][INFO] - agents played in iteration 611 are Alice, Bob [2026-03-25 23:24:32,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:24:32,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:24:32,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:24:32,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:24:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:24:33,879][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:24:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:24:35,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:24:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:24:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:24:37,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:24:37,819][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:24:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:24:39,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:24:39,789][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:24:40,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:24:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:24:41,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:24:42,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:24:43,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:24:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:24:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:24:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:24:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:24:46,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:24:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:24:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:24:48,329][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:24:48,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:24:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:24:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:24:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:24:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:24:52,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:24:52,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:24:53,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:24:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:24:54,898][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:24:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:24:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:24:56,871][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:24:57,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:24:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:24:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:24:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:25:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:25:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:25:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:25:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:25:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:25:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:25:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:25:05,087][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:25:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:25:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:25:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:25:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:25:08,373][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:25:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:25:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:25:10,388][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:25:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:25:11,700][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:25:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:25:13,014][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:25:13,670][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:25:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:25:14,984][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:25:15,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:25:16,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:25:17,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:25:17,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:25:17,659][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:25:19,055][__main__][INFO] - Iteration 612 took 53s (11.31% Gen, 86.08% Train). Generation: 6s, Training: 46s. Estimated remaining time: 5h 47m 56s. Estimated total time: 14h 51m 51s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 11s, 500 more iterations: 7h 25m 55s. [2026-03-25 23:25:19,062][__main__][INFO] - Starting iteration 612. [2026-03-25 23:25:19,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:25:19,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:25:24,266][__main__][INFO] - Number of regex retries in iteration 612: 0 [2026-03-25 23:25:24,267][__main__][INFO] - agents played in iteration 612 are Alice, Bob [2026-03-25 23:25:24,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:25:24,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:25:24,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:25:24,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:25:25,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:25:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:25:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:25:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:25:28,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:25:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:25:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:25:30,181][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:25:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:25:31,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:25:32,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:25:32,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:25:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:25:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:25:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:25:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:25:36,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:25:36,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:25:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:25:38,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:25:38,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:25:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:25:40,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:25:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:25:41,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:25:42,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:25:42,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:25:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:25:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:25:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:25:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:25:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:25:46,608][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:25:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:25:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:25:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:25:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:25:49,892][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:25:50,549][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:25:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:25:51,863][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:25:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:25:53,175][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:25:53,832][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:25:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:25:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:25:55,802][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:25:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:25:57,448][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:25:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:25:58,762][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:25:59,419][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:26:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:26:00,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:26:01,389][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:26:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:26:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:26:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:26:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:26:04,672][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:26:05,329][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:26:05,986][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:26:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:26:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:26:07,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:26:08,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:26:10,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:26:12,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:26:12,217][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:26:13,551][__main__][INFO] - Iteration 613 took 54s (9.54% Gen, 88.01% Train). Generation: 5s, Training: 47s. Estimated remaining time: 6h 3m 14s. Estimated total time: 15h 8m 3s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 1s. [2026-03-25 23:26:13,555][__main__][INFO] - Starting iteration 613. [2026-03-25 23:26:13,561][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:26:13,561][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:26:18,613][__main__][INFO] - Number of regex retries in iteration 613: 0 [2026-03-25 23:26:18,615][__main__][INFO] - agents played in iteration 613 are Alice, Bob [2026-03-25 23:26:19,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:26:19,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:26:19,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:26:19,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:26:20,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:26:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:26:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:26:22,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:26:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:26:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:26:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:26:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:26:25,433][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:26:26,090][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:26:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:26:27,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:26:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:26:28,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:26:29,375][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:26:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:26:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:26:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:26:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:26:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:26:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:26:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:26:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:26:35,284][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:26:35,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:26:36,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:26:37,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:26:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:26:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:26:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:26:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:26:40,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:26:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:26:41,854][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:26:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:26:43,167][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:26:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:26:44,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:26:45,137][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:26:45,793][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:26:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:26:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:26:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:26:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:26:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:26:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:26:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:26:51,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:26:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:26:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:26:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:26:54,009][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:26:54,666][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:26:55,324][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:26:55,981][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:26:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:26:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:26:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:26:58,612][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:26:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:26:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:27:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:27:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:27:01,898][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:27:02,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:27:03,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:27:04,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:27:04,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:27:04,552][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:27:05,902][__main__][INFO] - Iteration 614 took 52s (9.66% Gen, 87.76% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 26m 42s. Estimated total time: 14h 32m 24s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 12s. [2026-03-25 23:27:05,905][__main__][INFO] - Starting iteration 614. [2026-03-25 23:27:05,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:27:05,909][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:27:10,872][__main__][INFO] - Number of regex retries in iteration 614: 0 [2026-03-25 23:27:10,874][__main__][INFO] - agents played in iteration 614 are Alice, Bob [2026-03-25 23:27:11,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:27:11,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:27:11,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:27:11,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:27:12,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:27:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:27:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:27:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:27:14,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:27:15,381][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:27:16,038][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:27:16,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:27:17,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:27:18,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:27:18,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:27:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:27:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:27:20,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:27:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:27:21,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:27:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:27:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:27:23,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:27:24,580][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:27:25,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:27:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:27:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:27:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:27:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:27:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:27:29,178][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:27:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:27:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:27:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:27:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:27:32,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:27:33,118][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:27:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:27:34,432][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:27:35,088][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:27:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:27:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:27:37,059][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:27:37,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:27:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:27:39,030][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:27:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:27:40,344][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:27:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:27:41,658][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:27:42,315][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:27:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:27:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:27:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:27:45,279][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:27:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:27:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:27:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:27:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:27:48,564][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:27:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:27:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:27:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:27:51,192][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:27:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:27:52,507][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:27:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:27:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:27:54,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:27:55,291][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:27:56,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:27:56,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:27:56,727][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:27:58,005][__main__][INFO] - Iteration 615 took 52s (9.53% Gen, 88.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 21m 43s. Estimated total time: 14h 28m 17s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 8s. [2026-03-25 23:27:58,008][__main__][INFO] - Starting iteration 615. [2026-03-25 23:27:58,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:27:58,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:28:02,928][__main__][INFO] - Number of regex retries in iteration 615: 0 [2026-03-25 23:28:02,929][__main__][INFO] - agents played in iteration 615 are Alice, Bob [2026-03-25 23:28:03,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:03,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:03,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:28:03,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:28:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:28:04,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:28:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:28:06,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:28:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:28:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:28:08,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:28:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:28:09,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:28:10,235][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:28:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:28:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:28:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:28:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:28:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:28:14,175][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:28:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:28:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:28:16,146][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:28:16,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:28:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:28:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:28:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:28:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:28:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:28:20,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:28:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:28:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:28:22,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:28:23,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:28:24,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:28:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:28:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:28:25,998][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:28:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:28:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:28:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:28:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:28:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:28:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:28:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:28:31,254][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:28:31,911][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:28:32,568][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:28:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:28:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:28:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:28:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:28:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:28:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:28:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:28:38,124][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:28:38,781][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:28:39,438][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:28:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:28:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:28:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:28:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:28:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:28:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:28:44,046][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:28:44,703][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:28:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:28:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:28:46,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:28:47,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:28:48,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:28:48,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:28:48,752][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:28:50,111][__main__][INFO] - Iteration 616 took 52s (9.40% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 20m 34s. Estimated total time: 14h 28m 0s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 0s. [2026-03-25 23:28:50,115][__main__][INFO] - Starting iteration 616. [2026-03-25 23:28:50,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:28:50,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:28:55,139][__main__][INFO] - Number of regex retries in iteration 616: 0 [2026-03-25 23:28:55,141][__main__][INFO] - agents played in iteration 616 are Alice, Bob [2026-03-25 23:28:55,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:55,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:55,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:28:55,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:28:56,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:28:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:28:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:28:58,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:28:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:28:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:29:00,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:29:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:29:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:29:02,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:29:03,118][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:29:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:29:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:29:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:29:05,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:29:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:29:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:29:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:29:08,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:29:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:29:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:29:10,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:29:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:29:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:29:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:29:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:29:13,628][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:29:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:29:14,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:29:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:29:16,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:29:16,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:29:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:29:18,227][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:29:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:29:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:29:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:29:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:29:21,510][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:29:22,168][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:29:22,824][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:29:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:29:24,138][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:29:24,795][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:29:25,453][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:29:26,112][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:29:26,770][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:29:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:29:28,420][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:29:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:29:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:29:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:29:31,049][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:29:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:29:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:29:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:29:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:29:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:29:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:29:35,648][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:29:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:29:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:29:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:29:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:29:38,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:29:39,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:29:40,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:29:40,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:29:40,910][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:29:42,307][__main__][INFO] - Iteration 617 took 52s (9.62% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 21m 32s. Estimated total time: 14h 29m 50s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 59s, 500 more iterations: 7h 14m 55s. [2026-03-25 23:29:42,310][__main__][INFO] - Starting iteration 617. [2026-03-25 23:29:42,315][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:29:42,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:29:47,254][__main__][INFO] - Number of regex retries in iteration 617: 0 [2026-03-25 23:29:47,256][__main__][INFO] - agents played in iteration 617 are Alice, Bob [2026-03-25 23:29:47,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:29:47,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:29:47,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:29:47,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:29:48,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:29:49,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:29:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:29:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:29:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:29:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:29:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:29:53,219][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:29:53,876][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:29:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:29:55,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:29:55,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:29:56,509][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:29:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:29:57,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:29:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:29:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:29:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:30:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:30:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:30:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:30:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:30:03,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:30:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:30:04,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:30:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:30:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:30:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:30:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:30:07,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:30:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:30:08,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:30:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:30:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:30:10,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:30:11,626][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:30:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:30:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:30:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:30:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:30:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:30:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:30:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:30:16,882][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:30:17,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:30:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:30:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:30:19,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:30:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:30:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:30:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:30:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:30:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:30:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:30:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:30:25,082][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:30:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:30:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:30:27,057][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:30:27,714][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:30:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:30:29,029][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:30:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:30:30,344][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:30:31,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:30:31,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:30:32,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:30:32,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:30:32,930][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:30:34,264][__main__][INFO] - Iteration 618 took 51s (9.51% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 16m 40s. Estimated total time: 14h 25m 51s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 55s. [2026-03-25 23:30:34,267][__main__][INFO] - Starting iteration 618. [2026-03-25 23:30:34,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:30:34,272][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:30:39,232][__main__][INFO] - Number of regex retries in iteration 618: 0 [2026-03-25 23:30:39,234][__main__][INFO] - agents played in iteration 618 are Alice, Bob [2026-03-25 23:30:39,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:30:39,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:30:39,940][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:30:39,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:30:40,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:30:41,246][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:30:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:30:42,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:30:43,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:30:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:30:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:30:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:30:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:30:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:30:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:30:47,810][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:30:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:30:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:30:49,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:30:50,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:30:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:30:51,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:30:52,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:30:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:30:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:30:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:30:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:30:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:30:56,345][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:30:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:30:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:30:58,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:30:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:30:59,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:31:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:31:00,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:31:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:31:02,256][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:31:02,912][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:31:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:31:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:31:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:31:05,539][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:31:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:31:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:31:07,508][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:31:08,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:31:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:31:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:31:10,135][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:31:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:31:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:31:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:31:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:31:13,671][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:31:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:31:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:31:15,641][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:31:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:31:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:31:17,612][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:31:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:31:18,925][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:31:19,582][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:31:20,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:31:20,896][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:31:21,552][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:31:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:31:22,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:31:23,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-25 23:31:24,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:31:24,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:31:24,877][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:31:26,363][__main__][INFO] - Iteration 619 took 52s (9.53% Gen, 87.62% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 18m 11s. Estimated total time: 14h 28m 14s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 7s. [2026-03-25 23:31:26,367][__main__][INFO] - Starting iteration 619. [2026-03-25 23:31:26,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:31:26,371][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:31:32,598][__main__][INFO] - Number of regex retries in iteration 619: 0 [2026-03-25 23:31:32,599][__main__][INFO] - agents played in iteration 619 are Alice, Bob [2026-03-25 23:31:33,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:31:33,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:31:33,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:31:33,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:31:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:31:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:31:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:31:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:31:36,444][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:31:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:31:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:31:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:31:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:31:39,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:31:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:31:41,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:31:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:31:42,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:31:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:31:43,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:31:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:31:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:31:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:31:46,293][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:31:46,950][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:31:47,607][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:31:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:31:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:31:49,578][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:31:50,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:31:50,892][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:31:51,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:31:52,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:31:52,862][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:31:53,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:31:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:31:54,833][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:31:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:31:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:31:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:31:57,461][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:31:58,118][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:31:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:31:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:32:00,087][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:32:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:32:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:32:02,056][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:32:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:32:03,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:32:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:32:04,683][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:32:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:32:06,319][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:32:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:32:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:32:08,289][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:32:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:32:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:32:10,259][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:32:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:32:11,573][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:32:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:32:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:32:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:32:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:32:14,858][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:32:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:32:16,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:32:16,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:32:18,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:32:18,064][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:32:18,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:32:19,349][__main__][INFO] - Iteration 620 took 52s (11.75% Gen, 85.82% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 32m 4s. Estimated total time: 14h 42m 59s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 29s. [2026-03-25 23:32:19,351][__main__][INFO] - Starting iteration 620. [2026-03-25 23:32:19,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:32:19,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:32:24,770][__main__][INFO] - Number of regex retries in iteration 620: 0 [2026-03-25 23:32:24,772][__main__][INFO] - agents played in iteration 620 are Alice, Bob [2026-03-25 23:32:25,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:32:25,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:32:25,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:32:25,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:32:26,164][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:32:26,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:32:27,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:32:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:32:28,746][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:32:29,402][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:32:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:32:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:32:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:32:32,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:32:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:32:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:32:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:32:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:32:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:32:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:32:36,627][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:32:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:32:37,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:32:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:32:39,254][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:32:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:32:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:32:41,224][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:32:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:32:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:32:43,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:32:43,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:32:44,507][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:32:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:32:45,821][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:32:46,477][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:32:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:32:47,792][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:32:48,449][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:32:49,106][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:32:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:32:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:32:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:32:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:32:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:32:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:32:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:32:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:32:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:32:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:32:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:32:56,989][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:32:57,946][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:32:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:32:59,268][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:32:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:33:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:33:01,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:33:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:33:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:33:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:33:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:33:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:33:05,232][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:33:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:33:06,570][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:33:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:33:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:33:08,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:33:10,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:33:11,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:33:11,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:33:11,408][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:33:12,695][__main__][INFO] - Iteration 621 took 53s (10.15% Gen, 87.43% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 37m 13s. Estimated total time: 14h 49m 1s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 54s, 500 more iterations: 7h 24m 30s. [2026-03-25 23:33:12,698][__main__][INFO] - Starting iteration 621. [2026-03-25 23:33:12,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:33:12,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:33:18,144][__main__][INFO] - Number of regex retries in iteration 621: 0 [2026-03-25 23:33:18,146][__main__][INFO] - agents played in iteration 621 are Alice, Bob [2026-03-25 23:33:18,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:33:18,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:33:18,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:33:18,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:33:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:33:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:33:20,926][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:33:21,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:33:22,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:33:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:33:23,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:33:24,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:33:24,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:33:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:33:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:33:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:33:27,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:33:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:33:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:33:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:33:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:33:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:33:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:33:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:33:32,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:33:33,517][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:33:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:33:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:33:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:33:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:33:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:33:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:33:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:33:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:33:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:33:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:33:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:33:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:33:42,163][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:33:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:33:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:33:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:33:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:33:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:33:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:33:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:33:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:33:48,141][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:33:48,802][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:33:49,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:33:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:33:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:33:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:33:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:33:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:33:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:33:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:33:55,223][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:33:55,883][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:33:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:33:57,202][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:33:57,861][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:33:58,518][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:33:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:33:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:34:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:34:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:34:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:34:02,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:34:03,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:34:04,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:34:04,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:34:04,770][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:34:06,348][__main__][INFO] - Iteration 622 took 53s (10.14% Gen, 86.97% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 41m 24s. Estimated total time: 14h 54m 6s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 24s, 500 more iterations: 7h 27m 3s. [2026-03-25 23:34:06,350][__main__][INFO] - Starting iteration 622. [2026-03-25 23:34:06,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:34:06,355][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:34:11,358][__main__][INFO] - Number of regex retries in iteration 622: 0 [2026-03-25 23:34:11,360][__main__][INFO] - agents played in iteration 622 are Alice, Bob [2026-03-25 23:34:12,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:34:12,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:34:12,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:34:12,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:34:12,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:34:13,559][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:34:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:34:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:34:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:34:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:34:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:34:17,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:34:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:34:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:34:19,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:34:20,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:34:20,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:34:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:34:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:34:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:34:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:34:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:34:24,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:34:25,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:34:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:34:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:34:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:34:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:34:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:34:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:34:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:34:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:34:31,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:34:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:34:32,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:34:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:34:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:34:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:34:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:34:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:34:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:34:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:34:37,897][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:34:38,554][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:34:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:34:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:34:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:34:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:34:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:34:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:34:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:34:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:34:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:34:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:34:46,115][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:34:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:34:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:34:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:34:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:34:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:34:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:34:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:34:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:34:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:34:52,691][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:34:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:34:54,007][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:34:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:34:55,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:34:56,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:34:57,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:34:57,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:34:57,703][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:34:59,123][__main__][INFO] - Iteration 623 took 52s (9.48% Gen, 87.82% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 25m 55s. Estimated total time: 14h 39m 30s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 57s, 500 more iterations: 7h 19m 45s. [2026-03-25 23:34:59,126][__main__][INFO] - Starting iteration 623. [2026-03-25 23:34:59,130][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:34:59,131][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:35:05,939][__main__][INFO] - Number of regex retries in iteration 623: 0 [2026-03-25 23:35:05,940][__main__][INFO] - agents played in iteration 623 are Alice, Bob [2026-03-25 23:35:06,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:06,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:06,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:35:06,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:35:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:35:08,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:35:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:35:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:35:10,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:35:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:35:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:35:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:35:12,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:35:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:35:14,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:35:14,739][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:35:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:35:16,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:35:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:35:17,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:35:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:35:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:35:19,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:35:20,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:35:20,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:35:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:35:21,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:35:22,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:35:23,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:35:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:35:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:35:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:35:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:35:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:35:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:35:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:35:28,564][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:35:29,224][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:35:29,882][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:35:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:35:31,199][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:35:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:35:32,516][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:35:33,174][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:35:33,832][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:35:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:35:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:35:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:35:36,471][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:35:37,130][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:35:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:35:38,451][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:35:39,451][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:35:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:35:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:35:41,428][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:35:42,087][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:35:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:35:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:35:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:35:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:35:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:35:46,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:35:46,702][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:35:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:35:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:35:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:35:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:35:49,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:35:50,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:35:52,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:35:52,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:35:52,058][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:35:53,294][__main__][INFO] - Iteration 624 took 54s (12.57% Gen, 85.14% Train). Generation: 6s, Training: 46s. Estimated remaining time: 5h 48m 17s. Estimated total time: 15h 2m 46s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 16s, 500 more iterations: 7h 31m 23s. [2026-03-25 23:35:53,297][__main__][INFO] - Starting iteration 624. [2026-03-25 23:35:53,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:35:53,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:35:58,925][__main__][INFO] - Number of regex retries in iteration 624: 0 [2026-03-25 23:35:58,927][__main__][INFO] - agents played in iteration 624 are Alice, Bob [2026-03-25 23:35:59,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:59,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:59,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:35:59,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:36:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:36:00,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:36:01,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:36:02,148][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:36:02,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:36:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:36:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:36:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:36:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:36:06,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:36:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:36:07,417][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:36:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:36:08,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:36:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:36:10,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:36:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:36:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:36:12,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:36:12,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:36:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:36:14,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:36:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:36:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:36:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:36:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:36:17,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:36:17,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:36:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:36:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:36:19,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:36:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:36:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:36:21,924][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:36:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:36:23,241][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:36:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:36:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:36:25,221][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:36:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:36:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:36:27,201][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:36:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:36:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:36:29,177][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:36:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:36:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:36:31,153][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:36:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:36:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:36:33,476][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:36:34,135][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:36:34,794][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:36:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:36:36,111][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:36:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:36:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:36:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:36:38,747][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:36:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:36:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:36:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:36:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:36:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:36:42,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:36:43,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:36:44,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:36:44,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:36:44,695][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:36:46,287][__main__][INFO] - Iteration 625 took 52s (10.61% Gen, 86.37% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 27m 45s. Estimated total time: 14h 43m 7s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 18s, 500 more iterations: 7h 21m 33s. [2026-03-25 23:36:46,290][__main__][INFO] - Starting iteration 625. [2026-03-25 23:36:46,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:36:46,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:36:51,269][__main__][INFO] - Number of regex retries in iteration 625: 0 [2026-03-25 23:36:51,270][__main__][INFO] - agents played in iteration 625 are Alice, Bob [2026-03-25 23:36:51,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:36:51,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:36:51,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:36:51,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:36:52,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:36:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:36:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:36:54,580][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:36:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:36:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:36:56,554][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:36:57,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:36:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:36:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:36:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:36:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:37:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:37:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:37:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:37:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:37:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:37:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:37:04,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:37:05,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:37:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:37:06,412][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:37:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:37:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:37:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:37:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:37:09,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:37:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:37:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:37:11,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:37:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:37:12,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:37:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:37:14,303][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:37:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:37:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:37:16,276][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:37:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:37:17,593][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:37:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:37:18,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:37:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:37:20,223][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:37:20,880][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:37:21,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:37:22,196][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:37:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:37:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:37:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:37:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:37:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:37:26,428][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:37:27,086][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:37:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:37:28,401][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:37:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:37:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:37:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:37:31,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:37:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:37:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:37:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:37:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:37:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:37:34,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:37:35,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:37:36,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:37:36,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:37:36,902][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:37:38,576][__main__][INFO] - Iteration 626 took 52s (9.52% Gen, 87.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 15m 8s. Estimated total time: 14h 31m 23s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 41s. [2026-03-25 23:37:38,580][__main__][INFO] - Starting iteration 626. [2026-03-25 23:37:38,593][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:37:38,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:37:43,745][__main__][INFO] - Number of regex retries in iteration 626: 0 [2026-03-25 23:37:43,746][__main__][INFO] - agents played in iteration 626 are Alice, Bob [2026-03-25 23:37:44,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:37:44,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:37:44,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:37:44,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:37:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:37:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:37:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:37:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:37:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:37:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:37:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:37:49,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:37:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:37:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:37:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:37:52,440][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:37:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:37:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:37:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:37:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:37:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:37:56,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:37:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:37:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:37:58,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:37:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:37:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:38:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:38:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:38:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:38:02,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:38:03,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:38:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:38:04,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:38:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:38:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:38:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:38:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:38:07,618][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:38:08,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:38:08,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:38:09,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:38:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:38:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:38:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:38:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:38:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:38:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:38:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:38:14,875][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:38:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:38:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:38:17,204][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:38:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:38:18,524][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:38:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:38:19,845][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:38:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:38:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:38:21,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:38:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:38:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:38:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:38:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:38:25,119][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:38:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:38:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:38:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:38:27,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:38:28,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:38:29,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:38:29,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:38:29,759][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:38:31,097][__main__][INFO] - Iteration 627 took 52s (9.81% Gen, 87.63% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 17m 59s. Estimated total time: 14h 35m 6s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 33s. [2026-03-25 23:38:31,099][__main__][INFO] - Starting iteration 627. [2026-03-25 23:38:31,103][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:38:31,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:38:36,122][__main__][INFO] - Number of regex retries in iteration 627: 0 [2026-03-25 23:38:36,123][__main__][INFO] - agents played in iteration 627 are Alice, Bob [2026-03-25 23:38:36,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:38:36,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:38:36,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:38:36,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:38:37,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:38:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:38:38,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:38:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:38:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:38:40,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:38:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:38:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:38:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:38:43,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:38:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:38:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:38:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:38:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:38:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:38:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:38:47,985][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:38:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:38:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:38:49,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:38:50,636][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:38:51,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:38:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:38:52,612][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:38:53,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:38:53,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:38:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:38:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:38:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:38:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:38:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:38:57,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:38:58,551][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:38:59,210][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:38:59,868][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:39:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:39:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:39:01,845][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:39:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:39:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:39:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:39:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:39:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:39:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:39:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:39:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:39:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:39:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:39:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:39:10,072][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:39:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:39:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:39:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:39:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:39:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:39:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:39:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:39:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:39:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:39:16,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:39:17,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:39:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:39:18,639][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:39:19,298][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:39:19,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:39:20,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:39:21,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:39:21,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:39:21,843][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:39:23,187][__main__][INFO] - Iteration 628 took 52s (9.64% Gen, 87.78% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 10m 6s. Estimated total time: 14h 28m 6s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 3s. [2026-03-25 23:39:23,193][__main__][INFO] - Starting iteration 628. [2026-03-25 23:39:23,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:39:23,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:39:28,603][__main__][INFO] - Number of regex retries in iteration 628: 0 [2026-03-25 23:39:28,607][__main__][INFO] - agents played in iteration 628 are Alice, Bob [2026-03-25 23:39:29,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:29,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:29,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:39:29,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:39:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:39:30,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:39:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:39:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:39:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:39:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:39:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:39:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:39:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:39:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:39:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:39:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:39:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:39:38,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:39:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:39:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:39:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:39:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:39:41,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:39:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:39:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:39:43,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:39:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:39:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:39:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:39:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:39:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:39:47,814][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:39:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:39:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:39:49,789][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:39:50,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:39:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:39:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:39:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:39:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:39:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:39:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:39:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:39:55,717][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:39:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:39:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:39:57,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:39:58,356][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:39:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:39:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:40:00,333][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:40:00,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:40:01,992][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:40:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:40:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:40:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:40:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:40:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:40:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:40:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:40:07,267][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:40:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:40:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:40:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:40:09,903][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:40:10,561][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:40:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:40:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:40:12,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:40:13,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:40:14,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:40:14,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:40:14,501][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:40:15,828][__main__][INFO] - Iteration 629 took 52s (10.28% Gen, 87.20% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 18m 21s. Estimated total time: 14h 37m 13s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 36s. [2026-03-25 23:40:15,831][__main__][INFO] - Starting iteration 629. [2026-03-25 23:40:15,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:40:15,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:40:20,933][__main__][INFO] - Number of regex retries in iteration 629: 0 [2026-03-25 23:40:20,934][__main__][INFO] - agents played in iteration 629 are Alice, Bob [2026-03-25 23:40:21,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:40:21,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:40:21,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:40:21,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:40:22,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:40:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:40:23,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:40:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:40:24,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:40:25,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:40:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:40:26,850][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:40:27,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:40:28,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:40:28,822][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:40:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:40:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:40:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:40:31,451][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:40:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:40:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:40:33,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:40:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:40:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:40:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:40:36,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:40:36,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:40:37,370][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:40:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:40:38,685][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:40:39,343][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:40:40,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:40:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:40:41,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:40:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:40:42,632][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:40:43,290][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:40:43,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:40:44,607][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:40:45,264][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:40:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:40:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:40:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:40:47,896][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:40:48,554][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:40:49,212][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:40:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:40:50,529][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:40:51,187][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:40:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:40:52,504][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:40:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:40:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:40:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:40:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:40:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:40:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:40:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:40:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:40:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:40:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:41:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:41:00,694][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:41:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:41:02,010][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:41:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:41:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:41:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:41:04,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:41:05,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:41:06,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:41:06,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:41:06,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:41:07,870][__main__][INFO] - Iteration 630 took 52s (9.80% Gen, 87.78% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 7m 32s. Estimated total time: 14h 27m 16s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 38s. [2026-03-25 23:41:07,872][__main__][INFO] - Starting iteration 630. [2026-03-25 23:41:07,877][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:41:07,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:41:12,727][__main__][INFO] - Number of regex retries in iteration 630: 0 [2026-03-25 23:41:12,728][__main__][INFO] - agents played in iteration 630 are Alice, Bob [2026-03-25 23:41:13,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:41:13,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:41:13,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:41:13,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:41:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:41:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:41:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:41:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:41:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:41:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:41:17,924][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:41:18,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:41:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:41:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:41:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:41:21,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:41:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:41:22,524][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:41:23,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:41:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:41:24,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:41:25,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:41:25,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:41:26,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:41:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:41:27,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:41:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:41:29,101][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:41:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:41:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:41:31,075][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:41:31,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:41:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:41:33,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:41:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:41:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:41:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:41:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:41:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:41:36,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:41:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:41:38,312][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:41:38,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:41:39,628][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:41:40,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:41:40,944][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:41:41,602][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:41:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:41:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:41:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:41:44,233][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:41:44,891][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:41:45,788][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:41:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:41:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:41:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:41:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:41:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:41:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:41:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:41:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:41:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:41:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:41:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:41:53,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:41:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:41:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:41:55,657][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:41:56,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:41:57,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:41:58,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:41:58,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:41:58,232][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:41:59,571][__main__][INFO] - Iteration 631 took 51s (9.38% Gen, 88.02% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 1m 1s. Estimated total time: 14h 21m 36s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 9s, 500 more iterations: 7h 10m 48s. [2026-03-25 23:41:59,574][__main__][INFO] - Starting iteration 631. [2026-03-25 23:41:59,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:41:59,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:42:04,365][__main__][INFO] - Number of regex retries in iteration 631: 0 [2026-03-25 23:42:04,367][__main__][INFO] - agents played in iteration 631 are Alice, Bob [2026-03-25 23:42:05,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:05,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:05,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:42:05,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:42:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:42:06,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:42:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:42:07,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:42:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:42:09,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:42:09,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:42:10,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:42:10,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:42:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:42:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:42:12,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:42:13,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:42:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:42:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:42:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:42:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:42:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:42:17,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:42:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:42:18,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:42:19,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:42:20,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:42:20,833][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:42:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:42:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:42:22,806][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:42:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:42:24,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:42:24,779][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:42:25,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:42:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:42:26,754][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:42:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:42:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:42:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:42:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:42:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:42:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:42:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:42:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:42:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:42:33,334][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:42:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:42:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:42:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:42:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:42:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:42:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:42:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:42:38,826][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:42:39,484][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:42:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:42:40,799][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:42:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:42:42,115][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:42:42,773][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:42:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:42:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:42:44,747][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:42:45,404][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:42:46,062][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:42:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:42:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:42:48,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:42:48,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:42:50,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:42:50,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:42:50,029][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:42:51,351][__main__][INFO] - Iteration 632 took 51s (9.25% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 1m 27s. Estimated total time: 14h 22m 55s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 17s, 500 more iterations: 7h 11m 27s. [2026-03-25 23:42:51,354][__main__][INFO] - Starting iteration 632. [2026-03-25 23:42:51,358][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:42:51,358][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:42:56,356][__main__][INFO] - Number of regex retries in iteration 632: 0 [2026-03-25 23:42:56,358][__main__][INFO] - agents played in iteration 632 are Alice, Bob [2026-03-25 23:42:56,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:57,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:57,042][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:42:57,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:42:57,689][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:42:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:42:58,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:42:59,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:43:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:43:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:43:01,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:43:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:43:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:43:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:43:04,228][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:43:04,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:43:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:43:06,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:43:06,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:43:07,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:43:08,197][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:43:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:43:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:43:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:43:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:43:11,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:43:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:43:12,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:43:13,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:43:14,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:43:14,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:43:15,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:43:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:43:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:43:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:43:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:43:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:43:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:43:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:43:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:43:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:43:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:43:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:43:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:43:24,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:43:24,693][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:43:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:43:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:43:26,670][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:43:27,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:43:27,988][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:43:28,646][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:43:29,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:43:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:43:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:43:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:43:32,262][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:43:32,920][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:43:33,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:43:34,237][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:43:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:43:35,554][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:43:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:43:36,871][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:43:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:43:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:43:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:43:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:43:40,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:43:40,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:43:41,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:43:41,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:43:41,989][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:43:43,413][__main__][INFO] - Iteration 633 took 52s (9.60% Gen, 87.66% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 5m 17s. Estimated total time: 14h 27m 36s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 48s. [2026-03-25 23:43:43,416][__main__][INFO] - Starting iteration 633. [2026-03-25 23:43:43,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:43:43,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:43:48,467][__main__][INFO] - Number of regex retries in iteration 633: 0 [2026-03-25 23:43:48,468][__main__][INFO] - agents played in iteration 633 are Alice, Bob [2026-03-25 23:43:49,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:43:49,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:43:49,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:43:49,211][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:43:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:43:50,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:43:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:43:51,810][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:43:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:43:53,130][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:43:53,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:43:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:43:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:43:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:43:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:43:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:43:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:43:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:43:59,073][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:43:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:44:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:44:01,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:44:01,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:44:02,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:44:03,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:44:03,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:44:04,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:44:05,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:44:05,676][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:44:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:44:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:44:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:44:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:44:08,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:44:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:44:10,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:44:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:44:11,618][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:44:12,278][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:44:12,938][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:44:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:44:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:44:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:44:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:44:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:44:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:44:17,561][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:44:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:44:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:44:19,541][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:44:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:44:20,862][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:44:21,847][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:44:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:44:23,168][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:44:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:44:24,487][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:44:25,146][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:44:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:44:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:44:27,126][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:44:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:44:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:44:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:44:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:44:30,423][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:44:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:44:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:44:32,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:44:33,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:44:34,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:44:34,280][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:44:34,281][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:44:35,800][__main__][INFO] - Iteration 634 took 52s (9.64% Gen, 87.46% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 9m 49s. Estimated total time: 14h 33m 1s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 30s. [2026-03-25 23:44:35,802][__main__][INFO] - Starting iteration 634. [2026-03-25 23:44:35,806][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:44:35,807][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:44:40,562][__main__][INFO] - Number of regex retries in iteration 634: 0 [2026-03-25 23:44:40,563][__main__][INFO] - agents played in iteration 634 are Alice, Bob [2026-03-25 23:44:41,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:44:41,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:44:41,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:44:41,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:44:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:44:42,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:44:43,334][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:44:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:44:44,650][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:44:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:44:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:44:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:44:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:44:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:44:48,603][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:44:49,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:44:49,926][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:44:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:44:51,246][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:44:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:44:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:44:53,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:44:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:44:54,540][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:44:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:44:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:44:56,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:44:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:44:57,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:44:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:44:59,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:44:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:45:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:45:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:45:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:45:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:45:03,118][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:45:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:45:04,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:45:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:45:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:45:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:45:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:45:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:45:08,392][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:45:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:45:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:45:10,369][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:45:11,028][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:45:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:45:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:45:13,006][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:45:14,010][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:45:14,671][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:45:15,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:45:15,990][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:45:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:45:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:45:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:45:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:45:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:45:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:45:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:45:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:45:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:45:22,586][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:45:23,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:45:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:45:24,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:45:25,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:45:26,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:45:26,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:45:26,489][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:45:27,931][__main__][INFO] - Iteration 635 took 52s (9.13% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 4m 42s. Estimated total time: 14h 28m 46s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 23s. [2026-03-25 23:45:27,934][__main__][INFO] - Starting iteration 635. [2026-03-25 23:45:27,938][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:45:27,939][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:45:32,950][__main__][INFO] - Number of regex retries in iteration 635: 0 [2026-03-25 23:45:32,952][__main__][INFO] - agents played in iteration 635 are Alice, Bob [2026-03-25 23:45:33,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:45:33,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:45:33,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:45:33,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:45:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:45:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:45:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:45:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:45:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:45:37,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:45:38,208][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:45:38,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:45:39,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:45:40,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:45:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:45:41,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:45:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:45:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:45:43,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:45:44,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:45:44,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:45:45,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:45:46,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:45:46,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:45:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:45:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:45:48,741][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:45:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:45:50,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:45:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:45:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:45:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:45:52,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:45:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:45:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:45:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:45:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:45:56,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:45:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:45:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:45:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:45:58,639][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:45:59,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:45:59,958][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:46:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:46:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:46:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:46:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:46:03,261][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:46:03,921][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:46:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:46:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:46:06,240][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:46:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:46:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:46:08,218][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:46:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:46:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:46:10,195][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:46:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:46:11,513][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:46:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:46:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:46:13,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:46:14,150][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:46:14,813][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:46:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:46:16,133][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:46:16,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:46:17,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:46:18,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:46:18,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:46:18,822][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:46:20,120][__main__][INFO] - Iteration 636 took 52s (9.61% Gen, 87.90% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 4m 47s. Estimated total time: 14h 29m 43s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 51s. [2026-03-25 23:46:20,123][__main__][INFO] - Starting iteration 636. [2026-03-25 23:46:20,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:46:20,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:46:25,094][__main__][INFO] - Number of regex retries in iteration 636: 0 [2026-03-25 23:46:25,095][__main__][INFO] - agents played in iteration 636 are Alice, Bob [2026-03-25 23:46:25,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:46:25,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:46:25,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:46:25,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:46:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:46:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:46:27,695][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:46:28,352][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:46:29,009][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:46:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:46:30,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:46:30,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:46:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:46:32,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:46:32,952][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:46:33,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:46:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:46:34,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:46:35,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:46:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:46:36,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:46:37,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:46:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:46:38,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:46:39,532][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:46:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:46:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:46:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:46:42,167][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:46:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:46:43,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:46:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:46:44,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:46:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:46:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:46:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:46:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:46:48,090][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:46:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:46:49,406][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:46:50,064][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:46:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:46:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:46:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:46:52,697][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:46:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:46:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:46:54,670][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:46:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:46:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:46:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:46:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:46:58,410][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:46:59,071][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:46:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:47:00,390][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:47:01,050][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:47:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:47:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:47:03,026][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:47:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:47:04,344][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:47:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:47:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:47:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:47:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:47:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:47:08,298][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:47:08,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:47:09,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:47:10,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:47:10,832][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:47:10,833][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:47:12,302][__main__][INFO] - Iteration 637 took 52s (9.52% Gen, 87.66% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 3m 49s. Estimated total time: 14h 29m 37s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 48s. [2026-03-25 23:47:12,305][__main__][INFO] - Starting iteration 637. [2026-03-25 23:47:12,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:47:12,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:47:17,412][__main__][INFO] - Number of regex retries in iteration 637: 0 [2026-03-25 23:47:17,413][__main__][INFO] - agents played in iteration 637 are Alice, Bob [2026-03-25 23:47:18,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:47:18,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:47:18,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:47:18,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:47:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:47:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:47:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:47:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:47:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:47:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:47:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:47:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:47:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:47:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:47:25,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:47:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:47:26,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:47:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:47:28,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:47:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:47:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:47:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:47:30,655][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:47:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:47:31,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:47:32,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:47:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:47:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:47:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:47:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:47:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:47:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:47:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:47:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:47:38,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:47:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:47:39,882][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:47:40,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:47:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:47:41,858][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:47:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:47:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:47:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:47:44,496][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:47:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:47:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:47:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:47:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:47:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:47:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:47:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:47:49,763][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:47:50,753][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:47:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:47:52,069][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:47:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:47:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:47:54,041][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:47:54,700][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:47:55,358][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:47:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:47:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:47:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:47:57,995][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:47:58,656][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:47:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:47:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:48:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:48:01,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:48:02,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:48:03,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:48:03,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:48:03,117][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:48:04,636][__main__][INFO] - Iteration 638 took 52s (9.75% Gen, 87.34% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 5m 28s. Estimated total time: 14h 32m 8s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 4s. [2026-03-25 23:48:04,638][__main__][INFO] - Starting iteration 638. [2026-03-25 23:48:04,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:48:04,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:48:09,546][__main__][INFO] - Number of regex retries in iteration 638: 0 [2026-03-25 23:48:09,548][__main__][INFO] - agents played in iteration 638 are Alice, Bob [2026-03-25 23:48:10,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:48:10,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:48:10,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:48:10,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:48:10,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:48:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:48:12,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:48:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:48:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:48:14,207][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:48:14,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:48:15,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:48:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:48:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:48:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:48:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:48:18,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:48:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:48:20,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:48:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:48:21,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:48:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:48:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:48:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:48:24,085][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:48:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:48:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:48:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:48:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:48:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:48:28,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:48:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:48:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:48:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:48:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:48:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:48:31,993][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:48:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:48:33,311][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:48:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:48:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:48:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:48:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:48:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:48:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:48:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:48:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:48:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:48:39,906][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:48:40,566][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:48:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:48:41,884][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:48:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:48:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:48:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:48:44,863][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:48:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:48:46,183][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:48:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:48:47,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:48:48,161][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:48:48,820][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:48:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:48:50,138][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:48:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:48:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:48:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:48:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:48:53,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:48:54,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:48:55,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:48:55,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:48:55,339][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:48:56,768][__main__][INFO] - Iteration 639 took 52s (9.41% Gen, 87.84% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 1m 16s. Estimated total time: 14h 28m 48s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 24s. [2026-03-25 23:48:56,771][__main__][INFO] - Starting iteration 639. [2026-03-25 23:48:56,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:48:56,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:49:02,071][__main__][INFO] - Number of regex retries in iteration 639: 0 [2026-03-25 23:49:02,072][__main__][INFO] - agents played in iteration 639 are Alice, Bob [2026-03-25 23:49:02,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:02,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:02,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:49:02,653][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:49:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:49:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:49:04,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:49:05,244][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:49:05,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:49:06,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:49:07,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:49:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:49:08,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:49:09,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:49:09,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:49:10,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:49:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:49:11,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:49:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:49:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:49:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:49:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:49:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:49:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:49:16,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:49:17,094][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:49:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:49:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:49:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:49:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:49:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:49:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:49:21,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:49:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:49:23,021][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:49:23,680][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:49:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:49:24,997][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:49:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:49:26,317][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:49:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:49:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:49:28,294][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:49:28,953][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:49:29,611][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:49:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:49:30,929][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:49:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:49:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:49:32,907][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:49:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:49:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:49:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:49:35,794][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:49:36,455][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:49:37,115][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:49:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:49:38,435][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:49:39,094][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:49:39,753][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:49:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:49:41,072][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:49:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:49:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:49:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:49:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:49:44,370][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:49:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:49:45,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:49:46,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:49:47,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:49:47,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:49:47,540][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:49:48,893][__main__][INFO] - Iteration 640 took 52s (10.16% Gen, 87.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 0m 14s. Estimated total time: 14h 28m 39s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 19s. [2026-03-25 23:49:48,895][__main__][INFO] - Starting iteration 640. [2026-03-25 23:49:48,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:49:48,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:49:53,723][__main__][INFO] - Number of regex retries in iteration 640: 0 [2026-03-25 23:49:53,724][__main__][INFO] - agents played in iteration 640 are Alice, Bob [2026-03-25 23:49:54,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:54,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:54,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:49:54,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:49:55,104][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:49:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:49:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:49:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:49:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:49:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:49:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:49:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:50:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:50:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:50:01,656][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:50:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:50:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:50:03,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:50:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:50:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:50:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:50:06,283][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:50:06,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:50:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:50:08,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:50:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:50:09,581][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:50:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:50:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:50:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:50:12,219][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:50:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:50:13,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:50:14,197][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:50:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:50:15,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:50:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:50:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:50:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:50:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:50:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:50:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:50:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:50:20,795][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:50:21,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:50:22,115][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:50:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:50:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:50:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:50:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:50:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:50:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:50:27,076][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:50:27,738][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:50:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:50:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:50:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:50:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:50:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:50:31,697][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:50:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:50:33,016][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:50:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:50:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:50:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:50:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:50:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:50:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:50:37,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:50:38,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:50:39,503][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:50:39,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:50:39,507][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:50:40,880][__main__][INFO] - Iteration 641 took 51s (9.28% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 57m 6s. Estimated total time: 14h 26m 23s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 11s. [2026-03-25 23:50:40,882][__main__][INFO] - Starting iteration 641. [2026-03-25 23:50:40,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:50:40,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:50:45,566][__main__][INFO] - Number of regex retries in iteration 641: 0 [2026-03-25 23:50:45,567][__main__][INFO] - agents played in iteration 641 are Alice, Bob [2026-03-25 23:50:46,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:50:46,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:50:46,161][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:50:46,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:50:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:50:47,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:50:48,105][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:50:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:50:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:50:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:50:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:50:51,400][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:50:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:50:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:50:53,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:50:54,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:50:54,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:50:55,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:50:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:50:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:50:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:50:57,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:50:58,655][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:50:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:50:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:51:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:51:01,294][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:51:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:51:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:51:03,274][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:51:03,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:51:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:51:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:51:05,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:51:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:51:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:51:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:51:08,551][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:51:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:51:09,869][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:51:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:51:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:51:11,847][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:51:12,506][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:51:13,165][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:51:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:51:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:51:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:51:15,805][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:51:16,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:51:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:51:17,782][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:51:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:51:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:51:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:51:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:51:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:51:22,040][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:51:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:51:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:51:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:51:24,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:51:25,335][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:51:25,995][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:51:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:51:27,316][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:51:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:51:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:51:29,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:51:30,023][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:51:31,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:51:31,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:51:31,198][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:51:32,500][__main__][INFO] - Iteration 642 took 51s (9.07% Gen, 88.40% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 50m 7s. Estimated total time: 14h 20m 15s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 1s, 500 more iterations: 7h 10m 7s. [2026-03-25 23:51:32,503][__main__][INFO] - Starting iteration 642. [2026-03-25 23:51:32,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:51:32,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:51:39,521][__main__][INFO] - Number of regex retries in iteration 642: 0 [2026-03-25 23:51:39,523][__main__][INFO] - agents played in iteration 642 are Alice, Bob [2026-03-25 23:51:40,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:51:40,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:51:40,197][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:51:40,198][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:51:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:51:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:51:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:51:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:51:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:51:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:51:44,770][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:51:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:51:46,084][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:51:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:51:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:51:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:51:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:51:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:51:50,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:51:50,684][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:51:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:51:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:51:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:51:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:51:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:51:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:51:55,286][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:51:55,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:51:56,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:51:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:51:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:51:58,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:51:59,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:51:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:52:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:52:01,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:52:01,864][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:52:02,522][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:52:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:52:03,837][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:52:04,495][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:52:05,153][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:52:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:52:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:52:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:52:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:52:08,442][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:52:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:52:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:52:10,415][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:52:11,074][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:52:11,731][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:52:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:52:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:52:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:52:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:52:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:52:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:52:16,574][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:52:17,232][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:52:17,891][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:52:18,548][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:52:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:52:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:52:20,521][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:52:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:52:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:52:22,494][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:52:23,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:52:23,872][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:52:25,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:52:25,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:52:25,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:52:26,314][__main__][INFO] - Iteration 643 took 53s (13.04% Gen, 84.65% Train). Generation: 7s, Training: 45s. Estimated remaining time: 5h 25m 47s. Estimated total time: 14h 56m 49s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 40s, 500 more iterations: 7h 28m 24s. [2026-03-25 23:52:26,317][__main__][INFO] - Starting iteration 643. [2026-03-25 23:52:26,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:52:26,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:52:31,118][__main__][INFO] - Number of regex retries in iteration 643: 0 [2026-03-25 23:52:31,119][__main__][INFO] - agents played in iteration 643 are Alice, Bob [2026-03-25 23:52:31,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:52:31,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:52:31,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:52:31,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:52:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:52:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:52:33,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:52:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:52:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:52:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:52:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:52:37,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:52:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:52:38,427][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:52:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:52:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:52:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:52:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:52:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:52:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:52:43,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:52:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:52:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:52:45,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:52:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:52:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:52:46,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:52:47,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:52:48,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:52:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:52:49,602][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:52:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:52:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:52:51,575][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:52:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:52:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:52:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:52:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:52:54,863][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:52:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:52:56,179][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:52:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:52:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:52:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:52:58,809][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:52:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:53:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:53:00,782][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:53:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:53:02,097][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:53:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:53:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:53:04,328][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:53:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:53:05,644][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:53:06,301][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:53:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:53:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:53:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:53:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:53:09,588][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:53:10,246][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:53:10,904][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:53:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:53:12,219][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:53:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:53:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:53:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:53:14,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:53:15,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:53:16,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:53:16,749][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:53:16,750][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:53:18,024][__main__][INFO] - Iteration 644 took 51s (9.28% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 49m 51s. Estimated total time: 14h 21m 45s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 52s. [2026-03-25 23:53:18,027][__main__][INFO] - Starting iteration 644. [2026-03-25 23:53:18,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:53:18,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:53:23,052][__main__][INFO] - Number of regex retries in iteration 644: 0 [2026-03-25 23:53:23,054][__main__][INFO] - agents played in iteration 644 are Alice, Bob [2026-03-25 23:53:23,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:53:23,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:53:23,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:53:23,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:53:24,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:53:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:53:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:53:26,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:53:26,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:53:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:53:28,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:53:28,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:53:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:53:30,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:53:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:53:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:53:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:53:32,867][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:53:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:53:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:53:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:53:35,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:53:36,154][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:53:36,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:53:37,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:53:38,126][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:53:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:53:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:53:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:53:40,755][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:53:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:53:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:53:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:53:43,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:53:44,041][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:53:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:53:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:53:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:53:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:53:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:53:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:53:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:53:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:53:49,957][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:53:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:53:51,271][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:53:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:53:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:53:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:53:53,900][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:53:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:53:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:53:56,157][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:53:56,819][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:53:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:53:58,137][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:53:58,795][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:53:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:54:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:54:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:54:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:54:02,088][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:54:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:54:03,404][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:54:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:54:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:54:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:54:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:54:06,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:54:07,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:54:08,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:54:08,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:54:08,588][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:54:09,853][__main__][INFO] - Iteration 645 took 51s (9.69% Gen, 87.86% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 50m 57s. Estimated total time: 14h 23m 43s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 22s, 500 more iterations: 7h 11m 51s. [2026-03-25 23:54:09,856][__main__][INFO] - Starting iteration 645. [2026-03-25 23:54:09,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:54:09,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:54:15,084][__main__][INFO] - Number of regex retries in iteration 645: 0 [2026-03-25 23:54:15,086][__main__][INFO] - agents played in iteration 645 are Alice, Bob [2026-03-25 23:54:15,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:54:15,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:54:15,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:54:15,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:54:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:54:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:54:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:54:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:54:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:54:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:54:20,341][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:54:20,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:54:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:54:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:54:22,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:54:23,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:54:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:54:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:54:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:54:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:54:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:54:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:54:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:54:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:54:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:54:30,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:54:30,854][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:54:31,511][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:54:32,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:54:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:54:33,482][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:54:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:54:34,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:54:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:54:36,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:54:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:54:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:54:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:54:38,738][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:54:39,395][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:54:40,052][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:54:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:54:41,367][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:54:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:54:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:54:43,338][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:54:43,995][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:54:44,653][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:54:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:54:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:54:46,625][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:54:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:54:48,177][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:54:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:54:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:54:50,149][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:54:50,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:54:51,463][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:54:52,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:54:52,778][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:54:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:54:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:54:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:54:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:54:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:54:56,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:54:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:54:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:54:58,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:54:59,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:55:00,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:55:00,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:55:00,705][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:55:02,296][__main__][INFO] - Iteration 646 took 52s (9.97% Gen, 86.99% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 0m 20s. Estimated total time: 14h 33m 58s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 59s. [2026-03-25 23:55:02,298][__main__][INFO] - Starting iteration 646. [2026-03-25 23:55:02,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:55:02,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:55:11,579][__main__][INFO] - Number of regex retries in iteration 646: 0 [2026-03-25 23:55:11,581][__main__][INFO] - agents played in iteration 646 are Alice, Bob [2026-03-25 23:55:12,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:55:12,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:55:12,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:55:12,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:55:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:55:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:55:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:55:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:55:15,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:55:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:55:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:55:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:55:18,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:55:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:55:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:55:20,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:55:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:55:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:55:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:55:22,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:55:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:55:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:55:24,790][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:55:25,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:55:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:55:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:55:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:55:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:55:28,731][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:55:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:55:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:55:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:55:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:55:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:55:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:55:33,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:55:33,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:55:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:55:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:55:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:55:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:55:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:55:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:55:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:55:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:55:39,901][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:55:40,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:55:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:55:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:55:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:55:43,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:55:43,845][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:55:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:55:45,467][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:55:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:55:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:55:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:55:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:55:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:55:49,410][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:55:50,067][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:55:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:55:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:55:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:55:52,696][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:55:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:55:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:55:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:55:55,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:55:56,005][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:55:57,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:55:57,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:55:57,541][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:55:59,088][__main__][INFO] - Iteration 647 took 56s (16.34% Gen, 80.93% Train). Generation: 9s, Training: 45s. Estimated remaining time: 6h 11m 52s. Estimated total time: 15h 46m 27s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 38s, 500 more iterations: 7h 53m 13s. [2026-03-25 23:55:59,092][__main__][INFO] - Starting iteration 647. [2026-03-25 23:55:59,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:55:59,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:56:04,423][__main__][INFO] - Number of regex retries in iteration 647: 0 [2026-03-25 23:56:04,424][__main__][INFO] - agents played in iteration 647 are Alice, Bob [2026-03-25 23:56:04,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:05,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:05,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:56:05,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:56:05,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:56:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:56:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:56:07,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:56:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:56:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:56:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:56:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:56:10,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:56:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:56:12,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:56:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:56:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:56:14,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:56:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:56:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:56:16,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:56:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:56:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:56:18,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:56:18,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:56:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:56:20,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:56:20,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:56:21,401][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:56:22,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:56:22,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:56:23,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:56:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:56:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:56:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:56:26,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:56:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:56:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:56:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:56:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:56:29,311][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:56:29,970][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:56:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:56:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:56:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:56:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:56:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:56:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:56:34,584][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:56:35,244][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:56:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:56:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:56:37,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:56:38,145][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:56:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:56:39,464][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:56:40,123][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:56:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:56:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:56:42,101][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:56:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:56:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:56:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:56:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:56:45,397][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:56:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:56:46,715][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:56:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:56:48,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:56:48,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:56:49,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:56:49,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:56:50,414][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:56:51,862][__main__][INFO] - Iteration 648 took 52s (10.10% Gen, 87.15% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 4m 0s. Estimated total time: 14h 39m 28s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 44s. [2026-03-25 23:56:51,864][__main__][INFO] - Starting iteration 648. [2026-03-25 23:56:51,869][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:56:51,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:56:57,226][__main__][INFO] - Number of regex retries in iteration 648: 0 [2026-03-25 23:56:57,227][__main__][INFO] - agents played in iteration 648 are Alice, Bob [2026-03-25 23:56:57,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:57,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:57,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:56:57,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:56:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:56:59,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:56:59,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:57:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:57:01,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:57:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:57:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:57:03,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:57:03,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:57:04,427][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:57:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:57:05,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:57:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:57:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:57:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:57:08,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:57:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:57:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:57:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:57:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:57:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:57:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:57:12,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:57:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:57:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:57:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:57:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:57:16,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:57:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:57:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:57:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:57:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:57:19,559][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:57:20,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:57:20,876][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:57:21,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:57:22,191][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:57:22,851][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:57:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:57:24,167][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:57:24,828][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:57:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:57:26,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:57:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:57:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:57:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:57:28,784][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:57:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:57:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:57:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:57:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:57:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:57:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:57:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:57:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:57:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:57:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:57:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:57:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:57:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:57:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:57:38,941][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:57:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:57:40,263][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:57:40,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:57:41,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:57:42,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:57:42,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:57:42,768][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:57:44,217][__main__][INFO] - Iteration 649 took 52s (10.23% Gen, 87.00% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 56m 9s. Estimated total time: 14h 32m 29s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 14s. [2026-03-25 23:57:44,220][__main__][INFO] - Starting iteration 649. [2026-03-25 23:57:44,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:57:44,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:57:49,287][__main__][INFO] - Number of regex retries in iteration 649: 0 [2026-03-25 23:57:49,289][__main__][INFO] - agents played in iteration 649 are Alice, Bob [2026-03-25 23:57:49,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:49,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:49,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:57:49,980][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:57:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:57:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:57:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:57:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:57:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:57:53,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:57:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:57:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:57:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:57:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:57:57,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:57:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:57:58,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:57:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:57:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:58:00,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:58:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:58:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:58:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:58:03,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:58:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:58:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:58:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:58:05,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:58:06,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:58:07,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:58:07,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:58:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:58:09,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:58:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:58:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:58:10,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:58:11,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:58:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:58:12,959][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:58:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:58:14,273][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:58:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:58:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:58:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:58:16,902][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:58:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:58:18,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:58:18,874][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:58:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:58:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:58:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:58:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:58:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:58:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:58:23,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:58:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:58:25,116][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:58:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:58:26,432][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:58:27,089][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:58:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:58:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:58:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:58:29,718][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:58:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:58:31,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:58:31,690][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:58:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:58:33,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:58:33,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:58:35,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:58:35,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:58:35,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:58:36,389][__main__][INFO] - Iteration 650 took 52s (9.71% Gen, 87.74% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 52m 13s. Estimated total time: 14h 29m 26s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 43s. [2026-03-25 23:58:36,391][__main__][INFO] - Starting iteration 650. [2026-03-25 23:58:36,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:58:36,395][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:58:41,068][__main__][INFO] - Number of regex retries in iteration 650: 0 [2026-03-25 23:58:41,070][__main__][INFO] - agents played in iteration 650 are Alice, Bob [2026-03-25 23:58:41,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:58:41,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:58:41,728][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:58:41,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:58:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:58:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:58:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:58:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:58:45,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:58:45,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:58:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:58:46,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:58:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:58:48,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:58:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:58:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:58:50,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:58:50,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:58:51,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:58:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:58:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:58:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:58:54,207][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:58:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:58:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:58:56,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:58:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:58:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:58:58,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:58:58,806][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:58:59,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:59:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:59:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:59:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:59:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:59:02,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:59:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:59:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:59:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:59:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:59:06,035][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:59:06,693][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:59:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:59:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:59:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:59:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:59:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:59:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:59:11,295][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:59:11,952][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:59:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:59:13,266][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:59:14,220][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:59:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:59:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:59:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:59:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:59:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:59:18,163][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:59:18,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:59:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:59:20,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:59:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:59:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:59:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:59:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:59:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:59:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:59:24,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:59:25,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:59:26,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:59:26,546][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:59:26,547][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:59:29,682][__main__][INFO] - Iteration 651 took 53s (8.77% Gen, 85.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 10m 3s. Estimated total time: 14h 48m 8s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 4s. [2026-03-25 23:59:29,685][__main__][INFO] - Starting iteration 651. [2026-03-25 23:59:29,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-25 23:59:29,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:59:34,642][__main__][INFO] - Number of regex retries in iteration 651: 0 [2026-03-25 23:59:34,643][__main__][INFO] - agents played in iteration 651 are Alice, Bob [2026-03-25 23:59:35,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:59:35,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:59:35,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:59:35,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:59:35,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:59:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:59:37,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:59:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:59:38,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:59:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:59:39,765][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:59:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:59:41,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:59:41,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:59:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:59:43,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:59:43,706][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:59:44,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:59:45,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:59:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:59:46,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:59:46,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:59:47,647][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:59:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:59:48,961][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:59:49,618][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:59:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:59:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:59:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:59:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:59:52,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:59:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:59:54,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:59:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:59:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:59:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:59:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:59:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:59:58,158][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:59:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:59:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:00:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:00:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:00:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:00:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:00:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:00:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:00:04,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:00:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:00:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:00:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:00:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:00:07,598][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:00:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:00:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:00:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:00:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:00:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:00:11,540][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:00:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:00:12,854][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:00:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:00:14,168][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:00:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:00:15,483][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:00:16,142][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:00:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:00:17,456][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:00:18,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:00:18,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 00:00:19,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:00:19,933][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:00:19,934][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:00:21,308][__main__][INFO] - Iteration 652 took 51s (9.60% Gen, 87.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 41m 23s. Estimated total time: 14h 20m 20s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 2s, 500 more iterations: 7h 10m 10s. [2026-03-26 00:00:21,311][__main__][INFO] - Starting iteration 652. [2026-03-26 00:00:21,315][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:00:21,315][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:00:26,168][__main__][INFO] - Number of regex retries in iteration 652: 0 [2026-03-26 00:00:26,169][__main__][INFO] - agents played in iteration 652 are Alice, Bob [2026-03-26 00:00:26,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:00:27,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:00:27,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:00:27,026][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:00:27,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:00:28,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:00:28,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:00:29,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:00:30,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:00:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:00:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:00:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:00:32,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:00:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:00:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:00:34,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:00:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:00:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:00:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:00:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:00:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:00:38,790][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:00:39,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:00:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:00:40,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:00:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:00:42,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:00:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:00:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:00:44,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:00:44,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:00:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:00:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:00:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:00:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:00:47,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:00:48,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:00:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:00:49,960][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:00:50,617][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:00:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:00:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:00:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:00:53,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:00:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:00:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:00:55,219][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:00:55,878][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:00:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:00:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:00:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:00:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:00:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:01:00,088][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:01:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:01:01,405][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:01:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:01:02,722][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:01:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:01:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:01:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:01:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:01:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:01:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:01:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:01:07,991][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:01:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:01:09,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:01:09,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:01:10,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:01:11,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:01:11,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:01:11,790][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:01:13,205][__main__][INFO] - Iteration 653 took 51s (9.35% Gen, 87.91% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 45m 2s. Estimated total time: 14h 24m 51s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 25s. [2026-03-26 00:01:13,208][__main__][INFO] - Starting iteration 653. [2026-03-26 00:01:13,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:01:13,212][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:01:18,074][__main__][INFO] - Number of regex retries in iteration 653: 0 [2026-03-26 00:01:18,076][__main__][INFO] - agents played in iteration 653 are Alice, Bob [2026-03-26 00:01:18,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:01:18,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:01:18,661][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:01:18,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:01:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:01:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:01:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:01:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:01:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:01:22,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:01:23,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:01:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:01:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:01:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:01:25,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:01:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:01:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:01:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:01:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:01:29,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:01:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:01:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:01:31,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:01:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:01:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:01:33,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:01:33,787][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:01:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:01:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:01:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:01:36,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:01:37,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:01:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:01:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:01:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:01:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:01:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:01:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:01:41,692][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:01:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:01:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:01:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:01:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:01:44,985][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:01:45,644][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:01:46,302][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:01:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:01:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:01:48,279][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:01:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:01:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:01:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:01:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:01:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:01:52,546][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:01:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:01:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:01:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:01:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:01:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:01:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:01:57,159][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:01:57,818][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:01:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:01:59,137][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:01:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:02:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:02:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:02:01,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:02:02,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:02:03,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:02:03,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:02:03,710][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:02:05,226][__main__][INFO] - Iteration 654 took 52s (9.35% Gen, 87.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 46m 14s. Estimated total time: 14h 26m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 28s. [2026-03-26 00:02:05,229][__main__][INFO] - Starting iteration 654. [2026-03-26 00:02:05,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:02:05,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:02:10,381][__main__][INFO] - Number of regex retries in iteration 654: 0 [2026-03-26 00:02:10,382][__main__][INFO] - agents played in iteration 654 are Alice, Bob [2026-03-26 00:02:11,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:02:11,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:02:11,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:02:11,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:02:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:02:12,358][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:02:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:02:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:02:14,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:02:14,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:02:15,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:02:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:02:16,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:02:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:02:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:02:18,942][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:02:19,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:02:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:02:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:02:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:02:22,233][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:02:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:02:23,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:02:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:02:24,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:02:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:02:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:02:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:02:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:02:28,161][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:02:28,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:02:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:02:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:02:30,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:02:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:02:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:02:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:02:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:02:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:02:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:02:35,406][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:02:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:02:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:02:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:02:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:02:38,700][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:02:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:02:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:02:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:02:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:02:41,993][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:02:42,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:02:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:02:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:02:44,862][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:02:45,521][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:02:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:02:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:02:47,498][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:02:48,157][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:02:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:02:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:02:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:02:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:02:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:02:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:02:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:02:53,428][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:02:54,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:02:54,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:02:56,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:02:56,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:02:56,984][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:02:58,252][__main__][INFO] - Iteration 655 took 53s (9.71% Gen, 87.89% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 2m 6s. Estimated total time: 14h 43m 41s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 22s, 500 more iterations: 7h 21m 50s. [2026-03-26 00:02:58,255][__main__][INFO] - Starting iteration 655. [2026-03-26 00:02:58,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:02:58,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:03:12,986][__main__][INFO] - Number of regex retries in iteration 655: 0 [2026-03-26 00:03:12,988][__main__][INFO] - agents played in iteration 655 are Alice, Bob [2026-03-26 00:03:13,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:03:13,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:03:13,603][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:03:13,604][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:03:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:03:14,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:03:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:03:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:03:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:03:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:03:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:03:18,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:03:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:03:20,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:03:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:03:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:03:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:03:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:03:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:03:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:03:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:03:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:03:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:03:26,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:03:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:03:28,060][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:03:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:03:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:03:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:03:30,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:03:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:03:32,015][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:03:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:03:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:03:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:03:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:03:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:03:35,966][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:03:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:03:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:03:37,940][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:03:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:03:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:03:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:03:40,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:03:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:03:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:03:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:03:43,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:03:43,864][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:03:44,523][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:03:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:03:46,171][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:03:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:03:47,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:03:48,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:03:48,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:03:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:03:50,120][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:03:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:03:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:03:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:03:52,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:03:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:03:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:03:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:03:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:03:56,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:03:56,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:03:57,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:03:58,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:03:58,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:03:58,592][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:03:59,955][__main__][INFO] - Iteration 656 took 1m 1s (23.87% Gen, 73.92% Train). Generation: 14s, Training: 45s. Estimated remaining time: 7h 25m 42s. Estimated total time: 17h 8m 18s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 49s, 500 more iterations: 8h 34m 9s. [2026-03-26 00:03:59,957][__main__][INFO] - Starting iteration 656. [2026-03-26 00:03:59,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:03:59,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:04:04,918][__main__][INFO] - Number of regex retries in iteration 656: 0 [2026-03-26 00:04:04,920][__main__][INFO] - agents played in iteration 656 are Alice, Bob [2026-03-26 00:04:05,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:05,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:05,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:04:05,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:04:06,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:04:06,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:04:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:04:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:04:08,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:04:09,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:04:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:04:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:04:11,570][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:04:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:04:12,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:04:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:04:14,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:04:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:04:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:04:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:04:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:04:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:04:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:04:18,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:04:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:04:20,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:04:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:04:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:04:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:04:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:04:23,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:04:24,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:04:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:04:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:04:26,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:04:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:04:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:04:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:04:28,696][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:04:29,354][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:04:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:04:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:04:31,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:04:31,984][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:04:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:04:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:04:33,956][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:04:34,613][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:04:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:04:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:04:36,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:04:37,243][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:04:38,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:04:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:04:39,539][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:04:40,197][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:04:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:04:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:04:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:04:42,830][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:04:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:04:44,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:04:44,805][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:04:45,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:04:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:04:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:04:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:04:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:04:48,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:04:49,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:04:50,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:04:50,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:04:50,653][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:04:51,994][__main__][INFO] - Iteration 657 took 52s (9.53% Gen, 87.90% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 43m 46s. Estimated total time: 14h 27m 14s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 37s. [2026-03-26 00:04:51,997][__main__][INFO] - Starting iteration 657. [2026-03-26 00:04:52,001][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:04:52,002][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:04:57,280][__main__][INFO] - Number of regex retries in iteration 657: 0 [2026-03-26 00:04:57,282][__main__][INFO] - agents played in iteration 657 are Alice, Bob [2026-03-26 00:04:57,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:57,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:57,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:04:57,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:04:58,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:04:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:04:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:05:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:05:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:05:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:05:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:05:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:05:03,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:05:04,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:05:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:05:05,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:05:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:05:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:05:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:05:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:05:09,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:05:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:05:10,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:05:11,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:05:11,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:05:12,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:05:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:05:13,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:05:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:05:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:05:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:05:16,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:05:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:05:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:05:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:05:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:05:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:05:20,234][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:05:20,891][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:05:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:05:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:05:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:05:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:05:24,178][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:05:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:05:25,494][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:05:26,152][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:05:26,810][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:05:27,467][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:05:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:05:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:05:29,440][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:05:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:05:31,051][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:05:31,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:05:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:05:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:05:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:05:34,340][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:05:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:05:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:05:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:05:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:05:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:05:38,289][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:05:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:05:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:05:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:05:40,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:05:41,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:05:42,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:05:42,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:05:42,879][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:05:44,308][__main__][INFO] - Iteration 658 took 52s (10.09% Gen, 87.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 47m 28s. Estimated total time: 14h 31m 48s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 54s. [2026-03-26 00:05:44,311][__main__][INFO] - Starting iteration 658. [2026-03-26 00:05:44,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:05:44,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:05:50,405][__main__][INFO] - Number of regex retries in iteration 658: 0 [2026-03-26 00:05:50,406][__main__][INFO] - agents played in iteration 658 are Alice, Bob [2026-03-26 00:05:51,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:05:51,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:05:51,266][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:05:51,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:05:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:05:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:05:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:05:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:05:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:05:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:05:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:05:56,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:05:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:05:57,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:05:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:05:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:05:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:06:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:06:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:06:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:06:02,381][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:06:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:06:03,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:06:04,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:06:05,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:06:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:06:06,324][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:06:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:06:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:06:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:06:08,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:06:09,610][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:06:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:06:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:06:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:06:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:06:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:06:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:06:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:06:14,869][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:06:15,527][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:06:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:06:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:06:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:06:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:06:18,813][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:06:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:06:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:06:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:06:21,443][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:06:22,100][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:06:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:06:23,663][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:06:24,321][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:06:24,979][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:06:25,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:06:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:06:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:06:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:06:28,268][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:06:28,925][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:06:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:06:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:06:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:06:31,556][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:06:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:06:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:06:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:06:34,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:06:34,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:06:36,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:06:36,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:06:36,126][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:06:37,416][__main__][INFO] - Iteration 659 took 53s (11.46% Gen, 86.10% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 59m 49s. Estimated total time: 14h 45m 2s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 30s, 500 more iterations: 7h 22m 31s. [2026-03-26 00:06:37,418][__main__][INFO] - Starting iteration 659. [2026-03-26 00:06:37,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:06:37,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:06:42,583][__main__][INFO] - Number of regex retries in iteration 659: 0 [2026-03-26 00:06:42,586][__main__][INFO] - agents played in iteration 659 are Alice, Bob [2026-03-26 00:06:43,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:06:43,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:06:43,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:06:43,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:06:43,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:06:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:06:45,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:06:45,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:06:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:06:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:06:47,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:06:48,375][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:06:49,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:06:49,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:06:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:06:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:06:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:06:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:06:52,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:06:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:06:54,291][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:06:54,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:06:55,606][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:06:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:06:56,922][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:06:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:06:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:06:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:06:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:07:00,209][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:07:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:07:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:07:02,182][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:07:02,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:07:03,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:07:04,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:07:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:07:05,471][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:07:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:07:06,787][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:07:07,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:07:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:07:08,759][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:07:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:07:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:07:10,732][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:07:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:07:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:07:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:07:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:07:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:07:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:07:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:07:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:07:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:07:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:07:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:07:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:07:19,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:07:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:07:20,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:07:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:07:22,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:07:22,837][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:07:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:07:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:07:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:07:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:07:26,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:07:26,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:07:27,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:07:27,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:07:27,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:07:29,083][__main__][INFO] - Iteration 660 took 51s (9.99% Gen, 87.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 34m 58s. Estimated total time: 14h 21m 3s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 31s. [2026-03-26 00:07:29,087][__main__][INFO] - Starting iteration 660. [2026-03-26 00:07:29,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:07:29,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:07:34,160][__main__][INFO] - Number of regex retries in iteration 660: 0 [2026-03-26 00:07:34,161][__main__][INFO] - agents played in iteration 660 are Alice, Bob [2026-03-26 00:07:34,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:07:34,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:07:34,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:07:34,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:07:35,583][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:07:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:07:36,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:07:37,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:07:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:07:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:07:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:07:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:07:40,794][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:07:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:07:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:07:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:07:43,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:07:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:07:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:07:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:07:46,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:07:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:07:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:07:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:07:48,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:07:49,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:07:49,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:07:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:07:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:07:51,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:07:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:07:53,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:07:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:07:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:07:55,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:07:55,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:07:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:07:57,235][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:07:57,893][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:07:58,551][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:07:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:07:59,867][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:08:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:08:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:08:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:08:02,497][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:08:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:08:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:08:04,470][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:08:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:08:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:08:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:08:07,391][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:08:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:08:08,708][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:08:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:08:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:08:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:08:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:08:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:08:12,654][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:08:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:08:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:08:14,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:08:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:08:15,943][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:08:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:08:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:08:17,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:08:18,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:08:19,697][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:08:19,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:08:19,701][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:08:21,078][__main__][INFO] - Iteration 661 took 51s (9.75% Gen, 87.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 39m 32s. Estimated total time: 14h 26m 29s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 14s. [2026-03-26 00:08:21,081][__main__][INFO] - Starting iteration 661. [2026-03-26 00:08:21,085][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:08:21,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:08:25,980][__main__][INFO] - Number of regex retries in iteration 661: 0 [2026-03-26 00:08:25,981][__main__][INFO] - agents played in iteration 661 are Alice, Bob [2026-03-26 00:08:26,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:08:26,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:08:26,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:08:26,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:08:27,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:08:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:08:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:08:29,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:08:29,849][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:08:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:08:31,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:08:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:08:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:08:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:08:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:08:34,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:08:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:08:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:08:36,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:08:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:08:37,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:08:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:08:39,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:08:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:08:40,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:08:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:08:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:08:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:08:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:08:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:08:44,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:08:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:08:45,624][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:08:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:08:46,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:08:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:08:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:08:48,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:08:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:08:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:08:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:08:51,543][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:08:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:08:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:08:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:08:54,174][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:08:54,833][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:08:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:08:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:08:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:08:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:08:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:08:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:08:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:09:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:09:00,994][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:09:01,652][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:09:02,310][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:09:02,968][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:09:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:09:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:09:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:09:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:09:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:09:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:09:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:09:08,233][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:09:08,891][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:09:09,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:09:10,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:09:11,504][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:09:11,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:09:11,508][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:09:13,022][__main__][INFO] - Iteration 662 took 51s (9.43% Gen, 87.66% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 37m 49s. Estimated total time: 14h 25m 38s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 49s. [2026-03-26 00:09:13,024][__main__][INFO] - Starting iteration 662. [2026-03-26 00:09:13,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:09:13,030][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:09:18,727][__main__][INFO] - Number of regex retries in iteration 662: 0 [2026-03-26 00:09:18,729][__main__][INFO] - agents played in iteration 662 are Alice, Bob [2026-03-26 00:09:19,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:09:19,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:09:19,520][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:09:19,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:09:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:09:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:09:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:09:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:09:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:09:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:09:24,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:09:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:09:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:09:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:09:26,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:09:27,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:09:28,010][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:09:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:09:29,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:09:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:09:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:09:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:09:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:09:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:09:33,265][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:09:33,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:09:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:09:35,236][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:09:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:09:36,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:09:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:09:37,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:09:38,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:09:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:09:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:09:40,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:09:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:09:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:09:42,473][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:09:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:09:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:09:44,448][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:09:45,106][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:09:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:09:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:09:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:09:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:09:48,396][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:09:49,054][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:09:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:09:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:09:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:09:51,943][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:09:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:09:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:09:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:09:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:09:55,235][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:09:55,893][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:09:56,551][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:09:57,210][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:09:57,869][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:09:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:09:59,186][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:09:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:10:00,503][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:10:01,161][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:10:01,819][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:10:02,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:10:03,338][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:10:04,432][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:10:04,434][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:10:04,436][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:10:05,635][__main__][INFO] - Iteration 663 took 52s (10.83% Gen, 86.88% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 48m 6s. Estimated total time: 14h 36m 48s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 40s, 500 more iterations: 7h 18m 24s. [2026-03-26 00:10:05,638][__main__][INFO] - Starting iteration 663. [2026-03-26 00:10:05,642][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:10:05,643][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:10:10,747][__main__][INFO] - Number of regex retries in iteration 663: 0 [2026-03-26 00:10:10,749][__main__][INFO] - agents played in iteration 663 are Alice, Bob [2026-03-26 00:10:11,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:11,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:11,314][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:10:11,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:10:11,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:10:12,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:10:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:10:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:10:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:10:15,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:10:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:10:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:10:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:10:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:10:18,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:10:19,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:10:19,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:10:20,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:10:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:10:21,796][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:10:22,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:10:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:10:23,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:10:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:10:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:10:25,744][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:10:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:10:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:10:27,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:10:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:10:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:10:29,693][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:10:30,351][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:10:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:10:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:10:32,326][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:10:32,984][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:10:33,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:10:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:10:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:10:35,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:10:36,288][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:10:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:10:37,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:10:38,267][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:10:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:10:39,585][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:10:40,244][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:10:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:10:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:10:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:10:42,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:10:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:10:44,569][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:10:45,229][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:10:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:10:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:10:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:10:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:10:48,530][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:10:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:10:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:10:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:10:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:10:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:10:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:10:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:10:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:10:54,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:10:55,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:10:56,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:10:56,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:10:56,383][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:10:57,732][__main__][INFO] - Iteration 664 took 52s (9.81% Gen, 87.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 38m 39s. Estimated total time: 14h 28m 12s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 6s. [2026-03-26 00:10:57,735][__main__][INFO] - Starting iteration 664. [2026-03-26 00:10:57,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:10:57,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:11:02,465][__main__][INFO] - Number of regex retries in iteration 664: 0 [2026-03-26 00:11:02,466][__main__][INFO] - agents played in iteration 664 are Alice, Bob [2026-03-26 00:11:03,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:03,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:03,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:11:03,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:11:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:11:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:11:05,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:11:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:11:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:11:07,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:11:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:11:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:11:08,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:11:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:11:10,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:11:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:11:11,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:11:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:11:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:11:13,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:11:14,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:11:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:11:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:11:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:11:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:11:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:11:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:11:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:11:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:11:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:11:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:11:21,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:11:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:11:22,831][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:11:23,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:11:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:11:24,807][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:11:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:11:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:11:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:11:27,443][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:11:28,101][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:11:28,760][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:11:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:11:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:11:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:11:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:11:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:11:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:11:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:11:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:11:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:11:35,689][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:11:36,349][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:11:37,008][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:11:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:11:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:11:38,986][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:11:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:11:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:11:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:11:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:11:42,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:11:42,936][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:11:43,595][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:11:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:11:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:11:45,568][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:11:46,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:11:47,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:11:48,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:11:48,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:11:48,170][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:11:49,435][__main__][INFO] - Iteration 665 took 51s (9.14% Gen, 88.40% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 31m 13s. Estimated total time: 14h 21m 38s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 9s, 500 more iterations: 7h 10m 49s. [2026-03-26 00:11:49,439][__main__][INFO] - Starting iteration 665. [2026-03-26 00:11:49,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:11:49,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:11:54,162][__main__][INFO] - Number of regex retries in iteration 665: 0 [2026-03-26 00:11:54,163][__main__][INFO] - agents played in iteration 665 are Alice, Bob [2026-03-26 00:11:54,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:54,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:54,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:11:54,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:11:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:11:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:11:56,760][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:11:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:11:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:11:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:11:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:12:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:12:00,702][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:12:01,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:12:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:12:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:12:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:12:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:05,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:06,618][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:12:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:12:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:12:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:12:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:12:11,221][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:12:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:12:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:12:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:12:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:12:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:12:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:12:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:12:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:12:17,140][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:12:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:12:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:12:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:12:19,770][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:12:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:12:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:12:21,743][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:12:22,400][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:12:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:12:23,715][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:12:24,373][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:12:25,031][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:12:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:12:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:12:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:12:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:12:28,648][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:12:29,308][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:12:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:12:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:12:31,278][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:12:31,936][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:12:32,594][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:12:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:12:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:12:34,567][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:12:35,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:12:35,882][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:12:36,539][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:12:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:12:37,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:12:38,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:12:39,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:12:39,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:12:39,788][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:12:41,129][__main__][INFO] - Iteration 666 took 51s (9.13% Gen, 88.27% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 30m 11s. Estimated total time: 14h 21m 28s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 8s, 500 more iterations: 7h 10m 44s. [2026-03-26 00:12:41,131][__main__][INFO] - Starting iteration 666. [2026-03-26 00:12:41,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:12:41,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:12:45,973][__main__][INFO] - Number of regex retries in iteration 666: 0 [2026-03-26 00:12:45,974][__main__][INFO] - agents played in iteration 666 are Alice, Bob [2026-03-26 00:12:46,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:12:46,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:12:46,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:12:46,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:12:47,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:12:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:12:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:12:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:12:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:12:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:12:51,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:12:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:12:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:12:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:12:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:12:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:12:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:12:55,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:56,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:57,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:57,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:58,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:59,102][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:13:00,420][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:01,079][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:02,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:13:04,374][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:13:05,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:13:05,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:13:06,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:13:07,009][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:13:07,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:13:08,328][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:13:08,987][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:13:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:13:10,304][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:13:10,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:13:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:13:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:13:12,940][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:13:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:13:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:13:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:13:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:13:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:13:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:13:17,552][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:13:18,211][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:13:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:13:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:13:20,452][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:13:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:13:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:13:22,430][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:13:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:13:23,749][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:13:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:13:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:13:25,727][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:13:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:13:27,046][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:13:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:13:28,364][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:13:29,024][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:13:29,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:13:30,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:13:31,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:13:31,714][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:13:31,717][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:13:32,963][__main__][INFO] - Iteration 667 took 51s (9.34% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 31m 41s. Estimated total time: 14h 23m 50s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 55s. [2026-03-26 00:13:32,967][__main__][INFO] - Starting iteration 667. [2026-03-26 00:13:32,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:13:32,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:13:38,036][__main__][INFO] - Number of regex retries in iteration 667: 0 [2026-03-26 00:13:38,038][__main__][INFO] - agents played in iteration 667 are Alice, Bob [2026-03-26 00:13:38,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:13:38,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:13:38,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:13:38,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:13:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:13:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:13:40,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:13:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:13:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:13:42,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:13:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:13:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:13:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:13:45,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:13:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:13:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:13:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:13:47,877][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:13:48,536][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:13:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:13:49,854][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:13:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:13:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:13:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:13:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:55,126][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:13:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:13:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:13:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:13:58,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:13:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:13:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:14:00,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:14:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:14:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:14:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:14:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:14:03,692][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:14:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:14:05,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:14:05,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:14:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:14:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:14:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:14:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:14:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:14:09,622][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:14:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:14:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:14:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:14:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:14:13,196][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:14:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:14:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:14:15,173][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:14:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:14:16,491][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:14:17,150][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:14:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:14:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:14:19,126][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:14:19,785][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:14:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:14:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:14:21,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:14:22,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:14:23,714][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:14:23,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:14:23,721][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:14:25,040][__main__][INFO] - Iteration 668 took 52s (9.73% Gen, 87.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 34m 49s. Estimated total time: 14h 27m 50s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 55s. [2026-03-26 00:14:25,042][__main__][INFO] - Starting iteration 668. [2026-03-26 00:14:25,045][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:14:25,046][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:14:29,732][__main__][INFO] - Number of regex retries in iteration 668: 0 [2026-03-26 00:14:29,733][__main__][INFO] - agents played in iteration 668 are Alice, Bob [2026-03-26 00:14:30,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:14:30,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:14:30,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:14:30,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:14:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:14:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:14:32,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:14:32,957][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:14:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:14:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:14:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:14:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:14:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:14:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:14:37,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:14:38,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:14:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:14:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:14:40,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:14:40,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:14:41,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:14:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:14:42,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:14:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:14:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:14:44,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:14:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:14:46,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:14:46,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:14:47,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:14:48,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:14:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:14:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:14:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:14:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:14:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:14:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:14:52,721][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:14:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:14:54,039][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:14:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:14:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:14:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:14:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:14:57,333][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:14:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:14:58,651][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:14:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:14:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:15:00,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:15:01,286][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:15:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:15:02,839][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:15:03,500][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:15:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:15:04,818][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:15:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:15:06,136][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:15:06,795][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:15:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:15:08,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:15:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:15:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:15:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:15:10,751][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:15:11,410][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:15:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:15:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:15:13,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:15:14,155][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:15:15,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:15:15,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:15:15,429][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:15:16,823][__main__][INFO] - Iteration 669 took 51s (9.05% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 29m 6s. Estimated total time: 14h 22m 59s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 17s, 500 more iterations: 7h 11m 29s. [2026-03-26 00:15:16,826][__main__][INFO] - Starting iteration 669. [2026-03-26 00:15:16,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:15:16,830][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:15:23,305][__main__][INFO] - Number of regex retries in iteration 669: 0 [2026-03-26 00:15:23,306][__main__][INFO] - agents played in iteration 669 are Alice, Bob [2026-03-26 00:15:23,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:15:23,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:15:23,959][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:15:23,960][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:15:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:15:25,243][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:15:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:15:26,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:15:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:15:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:15:28,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:15:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:15:29,856][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:15:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:15:31,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:15:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:15:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:15:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:15:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:15:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:15:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:15:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:15:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:15:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:15:37,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:15:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:15:39,075][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:15:39,734][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:15:40,392][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:15:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:15:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:15:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:15:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:15:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:15:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:15:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:15:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:15:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:15:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:15:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:15:48,297][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:15:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:15:49,614][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:15:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:15:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:15:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:15:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:15:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:15:53,566][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:15:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:15:54,884][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:15:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:15:56,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:15:57,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:15:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:15:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:15:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:15:59,780][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:16:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:16:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:16:01,756][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:16:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:16:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:16:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:16:04,392][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:16:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:16:05,710][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:16:06,369][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:16:07,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:16:07,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:16:08,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:16:08,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:16:08,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:16:10,128][__main__][INFO] - Iteration 670 took 53s (12.15% Gen, 85.58% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 53m 34s. Estimated total time: 14h 48m 20s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 50s, 500 more iterations: 7h 24m 10s. [2026-03-26 00:16:10,131][__main__][INFO] - Starting iteration 670. [2026-03-26 00:16:10,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:16:10,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:16:15,248][__main__][INFO] - Number of regex retries in iteration 670: 0 [2026-03-26 00:16:15,249][__main__][INFO] - agents played in iteration 670 are Alice, Bob [2026-03-26 00:16:16,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:16:16,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:16:16,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:16:16,086][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:16:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:16:17,353][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:16:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:16:18,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:16:19,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:16:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:16:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:16:21,306][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:16:21,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:16:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:16:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:16:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:16:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:16:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:16:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:16:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:16:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:16:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:16:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:16:29,219][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:16:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:16:30,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:16:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:16:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:16:32,517][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:16:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:16:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:16:34,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:16:35,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:16:35,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:16:36,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:16:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:16:37,794][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:16:38,452][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:16:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:16:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:16:40,431][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:16:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:16:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:16:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:16:43,069][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:16:43,729][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:16:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:16:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:16:45,708][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:16:46,368][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:16:47,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:16:47,687][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:16:48,584][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:16:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:16:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:16:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:16:51,223][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:16:51,883][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:16:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:16:53,203][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:16:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:16:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:16:55,182][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:16:55,842][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:16:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:16:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:16:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:16:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:16:59,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:16:59,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:17:01,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:17:01,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:17:01,027][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:17:02,305][__main__][INFO] - Iteration 671 took 52s (9.80% Gen, 87.75% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 33m 53s. Estimated total time: 14h 29m 31s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 45s. [2026-03-26 00:17:02,307][__main__][INFO] - Starting iteration 671. [2026-03-26 00:17:02,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:17:02,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:17:11,874][__main__][INFO] - Number of regex retries in iteration 671: 0 [2026-03-26 00:17:11,876][__main__][INFO] - agents played in iteration 671 are Alice, Bob [2026-03-26 00:17:12,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:17:12,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:17:12,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:17:12,628][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:17:13,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:17:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:17:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:17:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:17:15,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:17:16,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:17:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:17:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:17:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:17:19,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:17:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:17:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:17:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:17:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:17:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:17:23,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:17:23,782][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:17:24,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:17:25,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:17:25,763][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:17:26,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:17:27,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:17:27,741][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:17:28,400][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:17:29,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:17:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:17:30,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:17:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:17:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:17:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:17:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:17:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:17:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:17:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:17:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:17:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:17:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:17:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:17:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:17:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:17:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:17:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:17:40,932][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:17:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:17:42,251][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:17:42,910][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:17:43,570][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:17:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:17:45,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:17:45,835][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:17:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:17:47,154][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:17:47,814][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:17:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:17:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:17:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:17:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:17:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:17:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:17:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:17:53,093][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:17:53,753][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:17:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:17:55,074][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:17:55,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:17:56,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:17:57,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:17:57,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:17:57,674][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:17:58,896][__main__][INFO] - Iteration 672 took 56s (16.90% Gen, 80.94% Train). Generation: 9s, Training: 45s. Estimated remaining time: 5h 46m 31s. Estimated total time: 15h 43m 5s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 18s, 500 more iterations: 7h 51m 32s. [2026-03-26 00:17:58,899][__main__][INFO] - Starting iteration 672. [2026-03-26 00:17:58,902][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:17:58,903][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:18:03,935][__main__][INFO] - Number of regex retries in iteration 672: 0 [2026-03-26 00:18:03,937][__main__][INFO] - agents played in iteration 672 are Alice, Bob [2026-03-26 00:18:04,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:04,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:04,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:18:04,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:18:05,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:18:05,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:18:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:18:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:18:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:18:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:18:09,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:18:09,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:18:10,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:18:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:18:11,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:18:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:18:13,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:18:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:18:14,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:18:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:18:15,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:18:16,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:18:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:18:17,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:18:18,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:18:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:18:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:18:20,337][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:18:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:18:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:18:22,313][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:18:22,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:18:23,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:18:24,290][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:18:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:18:25,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:18:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:18:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:18:27,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:18:28,243][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:18:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:18:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:18:30,219][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:18:30,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:18:31,536][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:18:32,196][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:18:32,854][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:18:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:18:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:18:34,830][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:18:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:18:36,148][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:18:37,037][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:18:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:18:38,357][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:18:39,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:18:39,674][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:18:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:18:40,992][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:18:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:18:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:18:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:18:43,629][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:18:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:18:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:18:45,604][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:18:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:18:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:18:47,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:18:48,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:18:49,363][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:18:49,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:18:49,369][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:18:50,817][__main__][INFO] - Iteration 673 took 51s (9.70% Gen, 87.51% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 27m 49s. Estimated total time: 14h 25m 16s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 38s. [2026-03-26 00:18:50,830][__main__][INFO] - Starting iteration 673. [2026-03-26 00:18:50,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:18:50,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:18:56,070][__main__][INFO] - Number of regex retries in iteration 673: 0 [2026-03-26 00:18:56,071][__main__][INFO] - agents played in iteration 673 are Alice, Bob [2026-03-26 00:18:56,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:56,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:56,861][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:18:56,862][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:18:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:18:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:18:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:18:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:19:00,140][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:19:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:19:01,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:19:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:19:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:19:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:19:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:19:04,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:19:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:19:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:19:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:19:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:19:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:19:08,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:19:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:19:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:19:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:19:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:19:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:19:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:19:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:19:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:19:14,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:19:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:19:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:19:16,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:19:17,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:19:17,923][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:19:18,582][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:19:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:19:19,899][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:19:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:19:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:19:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:19:22,535][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:19:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:19:23,852][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:19:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:19:25,170][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:19:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:19:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:19:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:19:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:19:28,466][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:19:29,426][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:19:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:19:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:19:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:19:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:19:32,722][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:19:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:19:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:19:34,700][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:19:35,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:19:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:19:36,677][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:19:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:19:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:19:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:19:39,314][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:19:39,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:19:40,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:19:41,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:19:41,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:19:41,840][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:19:43,117][__main__][INFO] - Iteration 674 took 52s (10.02% Gen, 87.54% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 33m 6s. Estimated total time: 14h 31m 25s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 42s. [2026-03-26 00:19:43,120][__main__][INFO] - Starting iteration 674. [2026-03-26 00:19:43,124][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:19:43,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:19:47,900][__main__][INFO] - Number of regex retries in iteration 674: 0 [2026-03-26 00:19:47,901][__main__][INFO] - agents played in iteration 674 are Alice, Bob [2026-03-26 00:19:48,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:48,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:48,479][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:19:48,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:19:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:19:49,787][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:19:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:19:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:19:51,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:19:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:19:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:19:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:19:54,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:19:55,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:19:55,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:19:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:19:57,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:19:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:19:58,348][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:19:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:19:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:20:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:20:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:20:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:20:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:20:02,957][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:20:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:20:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:20:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:20:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:20:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:20:06,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:20:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:20:08,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:20:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:20:09,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:20:10,208][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:20:10,867][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:20:11,526][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:20:12,185][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:20:12,844][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:20:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:20:14,163][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:20:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:20:15,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:20:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:20:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:20:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:20:18,118][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:20:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:20:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:20:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:20:20,986][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:20:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:20:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:20:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:20:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:20:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:20:24,943][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:20:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:20:26,261][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:20:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:20:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:20:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:20:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:20:29,556][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:20:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:20:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:20:31,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:20:32,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:20:33,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:20:33,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:20:33,459][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:20:34,851][__main__][INFO] - Iteration 675 took 51s (9.23% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 22m 57s. Estimated total time: 14h 22m 8s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 12s, 500 more iterations: 7h 11m 4s. [2026-03-26 00:20:34,853][__main__][INFO] - Starting iteration 675. [2026-03-26 00:20:34,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:20:34,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:20:39,895][__main__][INFO] - Number of regex retries in iteration 675: 0 [2026-03-26 00:20:39,897][__main__][INFO] - agents played in iteration 675 are Alice, Bob [2026-03-26 00:20:40,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:20:40,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:20:40,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:20:40,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:20:41,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:20:41,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:20:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:20:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:20:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:20:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:20:45,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:20:45,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:20:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:20:47,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:20:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:20:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:20:49,150][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:20:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:20:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:20:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:20:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:20:52,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:20:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:20:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:20:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:20:55,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:20:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:20:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:20:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:20:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:20:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:20:59,031][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:20:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:21:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:21:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:21:01,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:21:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:21:02,986][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:21:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:21:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:21:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:21:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:21:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:21:06,940][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:21:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:21:08,258][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:21:08,918][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:21:09,578][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:21:10,237][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:21:10,896][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:21:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:21:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:21:13,167][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:21:13,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:21:14,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:21:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:21:15,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:21:16,465][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:21:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:21:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:21:18,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:21:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:21:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:21:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:21:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:21:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:21:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:21:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:21:23,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:21:24,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:21:25,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:21:25,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:21:25,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:21:26,723][__main__][INFO] - Iteration 676 took 51s (9.71% Gen, 87.97% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 24m 24s. Estimated total time: 14h 24m 27s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 26s, 500 more iterations: 7h 12m 13s. [2026-03-26 00:21:26,725][__main__][INFO] - Starting iteration 676. [2026-03-26 00:21:26,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:21:26,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:21:31,477][__main__][INFO] - Number of regex retries in iteration 676: 0 [2026-03-26 00:21:31,478][__main__][INFO] - agents played in iteration 676 are Alice, Bob [2026-03-26 00:21:31,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:21:32,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:21:32,026][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:21:32,026][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:21:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:21:33,341][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:21:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:21:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:21:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:21:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:21:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:21:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:21:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:21:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:21:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:21:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:21:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:21:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:21:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:21:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:21:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:21:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:21:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:21:45,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:21:45,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:21:46,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:21:47,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:21:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:21:48,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:21:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:21:49,802][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:21:50,461][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:21:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:21:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:21:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:21:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:21:53,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:21:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:21:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:21:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:21:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:21:57,054][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:21:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:21:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:21:59,031][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:21:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:22:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:22:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:22:01,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:22:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:22:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:22:03,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:22:04,636][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:22:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:22:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:22:06,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:22:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:22:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:22:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:22:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:22:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:22:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:22:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:22:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:22:12,548][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:22:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:22:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:22:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:22:15,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:22:16,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:22:17,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:22:17,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:22:17,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:22:18,592][__main__][INFO] - Iteration 677 took 51s (9.15% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 23m 30s. Estimated total time: 14h 24m 25s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 26s, 500 more iterations: 7h 12m 12s. [2026-03-26 00:22:18,595][__main__][INFO] - Starting iteration 677. [2026-03-26 00:22:18,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:22:18,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:22:23,473][__main__][INFO] - Number of regex retries in iteration 677: 0 [2026-03-26 00:22:23,475][__main__][INFO] - agents played in iteration 677 are Alice, Bob [2026-03-26 00:22:24,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:22:24,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:22:24,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:22:24,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:22:24,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:22:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:22:26,090][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:22:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:22:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:22:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:22:28,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:22:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:22:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:22:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:22:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:22:32,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:22:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:22:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:22:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:22:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:22:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:22:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:22:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:22:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:22:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:22:38,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:22:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:22:39,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:22:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:22:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:22:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:22:42,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:22:43,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:22:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:22:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:22:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:22:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:22:46,506][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:22:47,165][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:22:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:22:48,482][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:22:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:22:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:22:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:22:51,118][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:22:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:22:52,437][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:22:53,096][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:22:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:22:54,420][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:22:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:22:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:22:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:22:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:22:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:22:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:22:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:23:00,037][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:23:00,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:23:01,356][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:23:02,015][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:23:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:23:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:23:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:23:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:23:05,313][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:23:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:23:06,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:23:07,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:23:08,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:23:09,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:23:09,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:23:09,375][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:23:10,829][__main__][INFO] - Iteration 678 took 52s (9.34% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 28m 46s. Estimated total time: 14h 30m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 16s. [2026-03-26 00:23:10,832][__main__][INFO] - Starting iteration 678. [2026-03-26 00:23:10,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:23:10,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:23:16,169][__main__][INFO] - Number of regex retries in iteration 678: 0 [2026-03-26 00:23:16,170][__main__][INFO] - agents played in iteration 678 are Alice, Bob [2026-03-26 00:23:16,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:23:16,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:23:16,769][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:23:16,770][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:23:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:23:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:23:18,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:23:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:23:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:23:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:23:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:23:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:23:22,644][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:23:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:23:23,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:23:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:23:25,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:23:25,940][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:23:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:23:27,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:23:27,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:23:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:23:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:23:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:23:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:23:31,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:23:31,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:23:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:23:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:23:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:23:34,516][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:23:35,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:23:35,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:23:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:23:37,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:23:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:23:38,471][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:23:39,131][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:23:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:23:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:23:41,107][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:23:41,767][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:23:42,426][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:23:43,085][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:23:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:23:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:23:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:23:45,723][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:23:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:23:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:23:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:23:48,360][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:23:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:23:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:23:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:23:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:23:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:23:52,645][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:23:53,305][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:23:53,964][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:23:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:23:55,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:23:55,943][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:23:56,602][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:23:57,262][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:23:57,921][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:23:58,580][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:23:59,240][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:23:59,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:24:00,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:24:01,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:24:01,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:24:01,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:24:03,195][__main__][INFO] - Iteration 679 took 52s (10.19% Gen, 87.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 30m 1s. Estimated total time: 14h 32m 41s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 20s. [2026-03-26 00:24:03,198][__main__][INFO] - Starting iteration 679. [2026-03-26 00:24:03,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:24:03,203][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:24:08,208][__main__][INFO] - Number of regex retries in iteration 679: 0 [2026-03-26 00:24:08,209][__main__][INFO] - agents played in iteration 679 are Alice, Bob [2026-03-26 00:24:08,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:24:08,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:24:08,896][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:24:08,897][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:24:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:24:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:24:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:24:11,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:24:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:24:12,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:24:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:24:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:24:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:24:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:24:16,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:24:16,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:24:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:24:18,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:24:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:24:19,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:24:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:24:20,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:24:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:24:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:24:22,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:24:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:24:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:24:24,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:24:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:24:25,985][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:24:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:24:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:24:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:24:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:24:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:24:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:24:30,601][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:24:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:24:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:24:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:24:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:24:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:24:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:24:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:24:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:24:36,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:24:37,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:24:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:24:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:24:39,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:24:39,835][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:24:40,494][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:24:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:24:42,078][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:24:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:24:43,400][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:24:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:24:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:24:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:24:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:24:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:24:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:24:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:24:48,681][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:24:49,340][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:24:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:24:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:24:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:24:51,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:24:52,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:24:54,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:24:54,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:24:54,067][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:24:55,310][__main__][INFO] - Iteration 680 took 52s (9.61% Gen, 88.00% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 24m 58s. Estimated total time: 14h 28m 29s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 14s. [2026-03-26 00:24:55,313][__main__][INFO] - Starting iteration 680. [2026-03-26 00:24:55,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:24:55,318][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:25:00,583][__main__][INFO] - Number of regex retries in iteration 680: 0 [2026-03-26 00:25:00,585][__main__][INFO] - agents played in iteration 680 are Alice, Bob [2026-03-26 00:25:01,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:01,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:01,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:25:01,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:25:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:25:02,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:25:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:25:03,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:25:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:25:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:25:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:25:06,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:25:07,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:25:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:25:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:25:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:25:09,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:25:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:25:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:25:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:25:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:25:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:25:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:25:14,290][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:25:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:25:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:25:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:25:16,928][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:25:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:25:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:25:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:25:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:25:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:25:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:25:21,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:25:22,202][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:25:22,861][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:25:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:25:24,265][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:25:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:25:25,585][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:25:26,244][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:25:26,903][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:25:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:25:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:25:28,880][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:25:29,539][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:25:30,198][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:25:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:25:31,516][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:25:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:25:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:25:33,838][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:25:34,499][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:25:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:25:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:25:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:25:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:25:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:25:38,454][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:25:39,113][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:25:39,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:25:40,431][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:25:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:25:41,748][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:25:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:25:43,067][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:25:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:25:44,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:25:45,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:25:46,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:25:46,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:25:46,488][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:25:47,904][__main__][INFO] - Iteration 681 took 52s (10.01% Gen, 87.29% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 32m 5s. Estimated total time: 14h 36m 28s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 38s, 500 more iterations: 7h 18m 14s. [2026-03-26 00:25:47,913][__main__][INFO] - Starting iteration 681. [2026-03-26 00:25:47,928][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:25:47,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:25:52,902][__main__][INFO] - Number of regex retries in iteration 681: 0 [2026-03-26 00:25:52,904][__main__][INFO] - agents played in iteration 681 are Alice, Bob [2026-03-26 00:25:53,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:53,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:53,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:25:53,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:25:54,253][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:25:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:25:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:25:56,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:25:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:25:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:25:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:25:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:25:59,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:26:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:26:00,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:26:01,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:26:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:26:02,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:26:03,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:26:04,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:26:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:26:05,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:26:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:26:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:26:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:26:08,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:26:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:26:09,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:26:10,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:26:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:26:11,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:26:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:26:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:26:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:26:13,962][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:26:14,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:26:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:26:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:26:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:26:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:26:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:26:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:26:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:26:19,892][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:26:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:26:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:26:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:26:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:26:23,190][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:26:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:26:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:26:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:26:26,151][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:26:26,812][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:26:27,471][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:26:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:26:28,789][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:26:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:26:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:26:30,766][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:26:31,425][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:26:32,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:26:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:26:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:26:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:26:34,720][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:26:35,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:26:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:26:36,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:26:37,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:26:38,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:26:38,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:26:38,773][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:26:40,074][__main__][INFO] - Iteration 682 took 52s (9.54% Gen, 87.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 23m 52s. Estimated total time: 14h 29m 8s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 34s. [2026-03-26 00:26:40,077][__main__][INFO] - Starting iteration 682. [2026-03-26 00:26:40,099][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:26:40,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:26:44,849][__main__][INFO] - Number of regex retries in iteration 682: 0 [2026-03-26 00:26:44,851][__main__][INFO] - agents played in iteration 682 are Alice, Bob [2026-03-26 00:26:45,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:26:45,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:26:45,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:26:45,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:26:46,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:26:46,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:26:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:26:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:26:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:26:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:26:49,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:26:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:26:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:26:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:26:52,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:26:53,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:26:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:26:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:26:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:26:55,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:26:56,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:26:57,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:26:57,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:26:58,561][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:26:59,220][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:26:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:27:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:27:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:27:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:27:02,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:27:03,172][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:27:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:27:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:27:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:27:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:27:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:27:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:27:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:27:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:27:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:27:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:27:10,420][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:27:11,078][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:27:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:27:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:27:13,055][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:27:13,714][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:27:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:27:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:27:15,692][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:27:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:27:17,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:27:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:27:18,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:27:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:27:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:27:20,601][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:27:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:27:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:27:22,578][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:27:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:27:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:27:24,556][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:27:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:27:25,875][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:27:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:27:27,194][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:27:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:27:28,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:27:29,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:27:30,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:27:30,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:27:30,443][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:27:31,793][__main__][INFO] - Iteration 683 took 51s (9.19% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 15m 29s. Estimated total time: 14h 21m 36s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 9s, 500 more iterations: 7h 10m 48s. [2026-03-26 00:27:31,796][__main__][INFO] - Starting iteration 683. [2026-03-26 00:27:31,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:27:31,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:27:36,608][__main__][INFO] - Number of regex retries in iteration 683: 0 [2026-03-26 00:27:36,609][__main__][INFO] - agents played in iteration 683 are Alice, Bob [2026-03-26 00:27:37,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:27:37,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:27:37,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:27:37,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:27:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:27:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:27:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:27:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:27:40,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:27:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:27:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:27:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:27:43,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:27:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:27:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:27:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:27:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:27:46,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:27:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:27:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:27:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:27:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:27:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:27:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:27:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:27:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:27:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:27:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:27:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:27:54,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:27:55,017][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:27:55,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:27:56,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:27:56,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:27:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:27:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:27:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:27:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:28:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:28:00,946][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:28:01,604][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:28:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:28:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:28:03,582][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:28:04,240][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:28:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:28:05,557][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:28:06,216][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:28:06,875][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:28:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:28:08,194][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:28:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:28:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:28:10,408][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:28:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:28:11,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:28:12,384][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:28:13,043][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:28:13,702][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:28:14,361][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:28:15,020][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:28:15,679][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:28:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:28:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:28:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:28:18,316][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:28:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:28:19,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:28:20,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:28:20,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:28:22,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:28:22,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:28:22,125][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:28:23,586][__main__][INFO] - Iteration 684 took 51s (9.29% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 16m 8s. Estimated total time: 14h 23m 8s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 34s. [2026-03-26 00:28:23,589][__main__][INFO] - Starting iteration 684. [2026-03-26 00:28:23,593][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:28:23,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:28:28,594][__main__][INFO] - Number of regex retries in iteration 684: 0 [2026-03-26 00:28:28,595][__main__][INFO] - agents played in iteration 684 are Alice, Bob [2026-03-26 00:28:29,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:28:29,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:28:29,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:28:29,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:28:29,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:28:30,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:28:31,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:28:31,778][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:28:32,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:28:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:28:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:28:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:28:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:28:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:28:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:28:37,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:28:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:28:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:28:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:28:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:28:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:28:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:28:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:28:42,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:28:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:28:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:28:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:28:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:28:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:28:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:28:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:28:47,584][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:28:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:28:48,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:28:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:28:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:28:50,878][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:28:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:28:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:28:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:28:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:28:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:28:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:28:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:28:56,148][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:28:56,806][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:28:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:28:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:28:58,783][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:28:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:29:00,100][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:29:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:29:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:29:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:29:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:29:03,701][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:29:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:29:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:29:05,678][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:29:06,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:29:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:29:07,653][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:29:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:29:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:29:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:29:10,289][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:29:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:29:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:29:12,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:29:13,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:29:14,128][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:29:14,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:29:14,131][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:29:15,548][__main__][INFO] - Iteration 685 took 51s (9.63% Gen, 87.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 18m 5s. Estimated total time: 14h 25m 57s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 58s. [2026-03-26 00:29:15,551][__main__][INFO] - Starting iteration 685. [2026-03-26 00:29:15,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:29:15,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:29:20,481][__main__][INFO] - Number of regex retries in iteration 685: 0 [2026-03-26 00:29:20,483][__main__][INFO] - agents played in iteration 685 are Alice, Bob [2026-03-26 00:29:21,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:29:21,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:29:21,216][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:29:21,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:29:21,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:29:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:29:23,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:29:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:29:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:29:25,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:29:25,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:29:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:29:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:29:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:29:28,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:29:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:29:29,729][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:29:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:29:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:29:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:29:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:29:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:29:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:29:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:29:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:29:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:29:36,315][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:29:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:29:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:29:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:29:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:29:39,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:29:40,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:29:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:29:41,587][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:29:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:29:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:29:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:29:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:29:44,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:29:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:29:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:29:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:29:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:29:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:29:48,837][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:29:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:29:50,154][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:29:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:29:51,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:29:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:29:52,791][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:29:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:29:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:29:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:29:55,671][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:29:56,330][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:29:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:29:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:29:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:29:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:29:59,630][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:30:00,289][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:30:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:30:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:30:02,271][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:30:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:30:03,591][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:30:04,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:30:05,223][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:30:06,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:30:06,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:30:06,382][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:30:07,606][__main__][INFO] - Iteration 686 took 52s (9.47% Gen, 88.18% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 18m 49s. Estimated total time: 14h 27m 32s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 46s. [2026-03-26 00:30:07,609][__main__][INFO] - Starting iteration 686. [2026-03-26 00:30:07,613][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:30:07,614][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:30:12,845][__main__][INFO] - Number of regex retries in iteration 686: 0 [2026-03-26 00:30:12,847][__main__][INFO] - agents played in iteration 686 are Alice, Bob [2026-03-26 00:30:13,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:13,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:13,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:30:13,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:30:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:30:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:30:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:30:16,169][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:30:16,829][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:30:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:30:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:30:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:30:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:30:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:30:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:30:21,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:30:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:30:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:30:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:30:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:30:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:30:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:30:26,072][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:30:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:30:27,392][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:30:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:30:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:30:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:30:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:30:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:30:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:30:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:30:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:30:33,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:30:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:30:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:30:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:30:35,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:30:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:30:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:30:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:30:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:30:39,275][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:30:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:30:40,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:30:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:30:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:30:42,578][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:30:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:30:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:30:44,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:30:45,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:30:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:30:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:30:47,544][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:30:48,204][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:30:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:30:49,524][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:30:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:30:50,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:30:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:30:52,171][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:30:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:30:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:30:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:30:54,815][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:30:55,475][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:30:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:30:56,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:30:57,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:30:58,745][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:30:58,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:30:58,749][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:31:00,091][__main__][INFO] - Iteration 687 took 52s (9.97% Gen, 87.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 25m 3s. Estimated total time: 14h 34m 39s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 19s. [2026-03-26 00:31:00,094][__main__][INFO] - Starting iteration 687. [2026-03-26 00:31:00,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:31:00,099][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:31:05,039][__main__][INFO] - Number of regex retries in iteration 687: 0 [2026-03-26 00:31:05,041][__main__][INFO] - agents played in iteration 687 are Alice, Bob [2026-03-26 00:31:05,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:31:05,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:31:05,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:31:05,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:31:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:31:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:31:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:31:08,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:31:09,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:31:09,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:31:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:31:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:31:11,792][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:31:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:31:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:31:13,769][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:31:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:31:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:31:15,749][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:31:16,411][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:31:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:31:17,731][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:31:18,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:31:19,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:31:19,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:31:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:31:21,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:31:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:31:22,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:31:23,008][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:31:23,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:31:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:31:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:31:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:31:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:31:26,968][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:31:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:31:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:31:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:31:29,607][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:31:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:31:30,926][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:31:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:31:32,246][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:31:32,904][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:31:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:31:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:31:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:31:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:31:36,199][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:31:36,861][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:31:37,523][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:31:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:31:39,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:31:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:31:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:31:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:31:41,820][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:31:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:31:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:31:43,799][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:31:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:31:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:31:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:31:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:31:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:31:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:31:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:31:49,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:31:49,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:31:51,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:31:51,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:31:51,103][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:31:52,456][__main__][INFO] - Iteration 688 took 52s (9.44% Gen, 87.97% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 22m 10s. Estimated total time: 14h 32m 39s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 19s. [2026-03-26 00:31:54,104][__main__][INFO] - Starting iteration 688. [2026-03-26 00:31:54,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:31:54,120][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:32:02,338][__main__][INFO] - Number of regex retries in iteration 688: 0 [2026-03-26 00:32:02,339][__main__][INFO] - agents played in iteration 688 are Alice, Bob [2026-03-26 00:32:02,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:02,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:02,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:32:02,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:32:03,687][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:32:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:32:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:32:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:32:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:32:06,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:32:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:32:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:32:08,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:32:09,577][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:32:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:32:10,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:32:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:32:12,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:32:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:32:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:32:14,192][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:32:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:32:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:32:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:32:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:32:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:32:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:32:18,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:32:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:32:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:32:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:32:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:32:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:32:22,766][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:32:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:32:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:32:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:32:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:32:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:32:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:32:27,377][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:32:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:32:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:32:29,354][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:32:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:32:30,673][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:32:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:32:31,991][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:32:32,650][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:32:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:32:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:32:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:32:35,636][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:32:36,297][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:32:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:32:37,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:32:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:32:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:32:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:32:40,257][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:32:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:32:41,577][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:32:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:32:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:32:43,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:32:44,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:32:44,874][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:32:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:32:46,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:32:47,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:32:48,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:32:48,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:32:48,212][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:32:49,486][__main__][INFO] - Iteration 689 took 55s (14.84% Gen, 82.83% Train). Generation: 8s, Training: 45s. Estimated remaining time: 5h 11m 33s. Estimated total time: 15h 22m 58s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 29s. [2026-03-26 00:32:49,489][__main__][INFO] - Starting iteration 689. [2026-03-26 00:32:49,494][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:32:49,494][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:32:54,493][__main__][INFO] - Number of regex retries in iteration 689: 0 [2026-03-26 00:32:54,494][__main__][INFO] - agents played in iteration 689 are Alice, Bob [2026-03-26 00:32:55,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:55,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:55,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:32:55,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:32:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:32:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:32:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:32:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:32:58,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:32:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:32:59,686][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:33:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:33:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:33:01,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:33:02,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:33:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:33:03,644][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:33:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:33:04,964][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:33:05,625][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:33:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:33:06,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:33:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:33:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:33:08,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:33:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:33:10,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:33:10,912][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:33:11,570][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:33:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:33:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:33:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:33:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:33:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:33:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:33:16,187][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:33:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:33:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:33:18,166][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:33:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:33:19,484][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:33:20,143][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:33:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:33:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:33:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:33:22,779][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:33:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:33:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:33:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:33:25,418][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:33:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:33:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:33:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:33:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:33:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:33:29,703][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:33:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:33:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:33:31,679][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:33:32,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:33:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:33:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:33:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:33:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:33:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:33:36,297][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:33:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:33:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:33:38,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:33:39,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:33:40,238][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:33:40,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:33:40,242][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:33:41,597][__main__][INFO] - Iteration 690 took 52s (9.60% Gen, 87.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 16m 8s. Estimated total time: 14h 28m 26s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 13s. [2026-03-26 00:33:41,600][__main__][INFO] - Starting iteration 690. [2026-03-26 00:33:41,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:33:41,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:33:46,733][__main__][INFO] - Number of regex retries in iteration 690: 0 [2026-03-26 00:33:46,734][__main__][INFO] - agents played in iteration 690 are Alice, Bob [2026-03-26 00:33:47,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:33:47,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:33:47,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:33:47,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:33:48,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:33:48,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:33:49,481][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:33:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:33:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:33:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:33:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:33:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:33:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:33:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:33:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:33:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:33:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:33:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:33:57,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:33:58,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:33:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:33:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:34:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:34:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:34:01,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:34:01,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:34:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:34:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:34:03,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:34:04,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:34:05,284][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:34:05,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:34:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:34:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:34:07,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:34:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:34:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:34:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:34:10,555][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:34:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:34:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:34:12,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:34:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:34:13,848][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:34:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:34:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:34:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:34:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:34:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:34:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:34:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:34:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:34:20,023][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:34:20,683][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:34:21,342][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:34:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:34:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:34:23,318][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:34:23,977][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:34:24,637][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:34:25,296][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:34:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:34:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:34:27,273][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:34:27,932][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:34:28,591][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:34:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:34:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:34:30,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:34:31,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:34:32,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:34:32,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:34:32,526][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:34:33,751][__main__][INFO] - Iteration 691 took 52s (9.83% Gen, 87.82% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 15m 55s. Estimated total time: 14h 29m 5s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 32s. [2026-03-26 00:34:33,754][__main__][INFO] - Starting iteration 691. [2026-03-26 00:34:33,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:34:33,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:34:38,619][__main__][INFO] - Number of regex retries in iteration 691: 0 [2026-03-26 00:34:38,621][__main__][INFO] - agents played in iteration 691 are Alice, Bob [2026-03-26 00:34:39,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:34:39,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:34:39,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:34:39,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:34:40,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:34:40,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:34:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:34:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:34:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:34:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:34:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:34:44,629][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:34:45,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:34:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:34:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:34:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:34:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:34:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:34:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:34:49,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:34:50,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:34:51,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:34:51,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:34:52,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:34:53,197][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:34:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:34:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:34:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:34:55,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:34:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:34:57,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:34:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:34:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:34:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:34:59,794][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:35:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:35:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:35:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:35:02,434][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:35:03,094][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:35:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:35:04,412][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:35:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:35:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:35:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:35:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:35:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:35:08,369][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:35:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:35:09,688][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:35:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:35:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:35:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:35:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:35:13,324][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:35:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:35:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:35:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:35:15,963][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:35:16,623][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:35:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:35:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:35:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:35:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:35:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:35:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:35:21,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:35:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:35:22,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:35:23,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:35:24,500][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:35:24,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:35:24,504][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:35:25,803][__main__][INFO] - Iteration 692 took 52s (9.34% Gen, 88.16% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 13m 24s. Estimated total time: 14h 27m 26s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-26 00:35:25,806][__main__][INFO] - Starting iteration 692. [2026-03-26 00:35:25,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:35:25,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:35:30,707][__main__][INFO] - Number of regex retries in iteration 692: 0 [2026-03-26 00:35:30,732][__main__][INFO] - agents played in iteration 692 are Alice, Bob [2026-03-26 00:35:31,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:35:31,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:35:31,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:35:31,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:35:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:35:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:35:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:35:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:35:34,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:35:35,228][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:35:35,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:35:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:35:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:35:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:35:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:35:39,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:35:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:35:40,497][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:35:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:35:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:35:42,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:35:43,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:35:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:35:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:35:45,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:35:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:35:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:35:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:35:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:35:48,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:35:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:35:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:35:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:35:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:35:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:35:52,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:35:53,015][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:35:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:35:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:35:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:35:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:35:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:35:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:35:57,627][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:35:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:35:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:35:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:36:00,263][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:36:00,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:36:01,581][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:36:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:36:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:36:03,814][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:36:04,473][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:36:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:36:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:36:06,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:36:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:36:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:36:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:36:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:36:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:36:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:36:11,065][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:36:11,724][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:36:12,384][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:36:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:36:13,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:36:14,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:36:15,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:36:16,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:36:16,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:36:16,267][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:36:17,563][__main__][INFO] - Iteration 693 took 51s (9.51% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 7m 41s. Estimated total time: 14h 22m 34s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 15s, 500 more iterations: 7h 11m 17s. [2026-03-26 00:36:17,566][__main__][INFO] - Starting iteration 693. [2026-03-26 00:36:17,570][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:36:17,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:36:22,541][__main__][INFO] - Number of regex retries in iteration 693: 0 [2026-03-26 00:36:22,543][__main__][INFO] - agents played in iteration 693 are Alice, Bob [2026-03-26 00:36:23,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:36:23,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:36:23,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:36:23,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:36:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:36:24,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:36:25,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:36:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:36:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:36:27,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:36:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:36:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:36:29,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:36:29,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:36:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:36:31,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:36:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:36:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:36:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:36:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:36:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:36:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:36:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:36:36,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:36:36,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:36:37,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:36:38,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:36:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:36:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:36:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:36:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:36:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:36:42,273][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:36:42,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:36:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:36:44,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:36:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:36:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:36:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:36:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:36:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:36:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:36:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:36:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:36:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:36:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:36:51,505][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:36:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:36:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:36:53,483][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:36:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:36:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:36:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:36:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:36:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:36:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:36:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:36:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:36:59,650][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:37:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:37:00,969][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:37:01,628][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:37:02,288][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:37:02,948][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:37:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:37:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:37:04,927][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:37:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:37:06,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:37:07,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:37:08,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:37:08,216][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:37:08,217][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:37:09,421][__main__][INFO] - Iteration 694 took 51s (9.59% Gen, 88.09% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 8m 28s. Estimated total time: 14h 24m 13s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 6s. [2026-03-26 00:37:09,423][__main__][INFO] - Starting iteration 694. [2026-03-26 00:37:09,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:37:09,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:37:14,417][__main__][INFO] - Number of regex retries in iteration 694: 0 [2026-03-26 00:37:14,418][__main__][INFO] - agents played in iteration 694 are Alice, Bob [2026-03-26 00:37:14,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:37:15,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:37:15,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:37:15,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:37:15,659][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:37:16,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:37:16,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:37:17,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:37:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:37:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:37:19,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:37:20,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:37:20,897][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:37:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:37:22,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:37:22,874][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:37:23,533][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:37:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:37:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:37:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:37:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:37:26,831][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:37:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:37:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:37:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:37:29,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:37:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:37:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:37:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:37:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:37:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:37:33,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:37:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:37:34,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:37:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:37:36,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:37:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:37:37,381][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:37:38,048][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:37:38,708][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:37:39,370][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:37:40,031][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:37:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:37:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:37:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:37:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:37:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:37:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:37:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:37:45,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:37:45,970][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:37:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:37:47,621][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:37:48,282][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:37:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:37:49,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:37:50,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:37:50,926][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:37:51,586][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:37:52,246][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:37:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:37:53,568][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:37:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:37:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:37:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:37:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:37:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:37:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:37:58,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:37:59,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:38:00,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:38:00,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:38:00,281][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:38:01,556][__main__][INFO] - Iteration 695 took 52s (9.57% Gen, 87.98% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 12m 13s. Estimated total time: 14h 28m 50s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 25s. [2026-03-26 00:38:01,559][__main__][INFO] - Starting iteration 695. [2026-03-26 00:38:01,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:38:01,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:38:09,292][__main__][INFO] - Number of regex retries in iteration 695: 0 [2026-03-26 00:38:09,293][__main__][INFO] - agents played in iteration 695 are Alice, Bob [2026-03-26 00:38:09,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:38:09,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:38:09,952][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:38:09,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:38:10,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:38:11,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:38:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:38:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:38:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:38:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:38:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:38:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:38:15,867][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:38:16,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:38:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:38:17,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:38:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:38:19,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:38:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:38:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:38:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:38:21,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:38:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:38:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:38:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:38:24,433][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:38:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:38:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:38:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:38:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:38:27,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:38:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:38:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:38:29,708][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:38:30,368][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:38:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:38:31,690][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:38:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:38:33,016][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:38:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:38:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:38:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:38:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:38:36,315][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:38:36,975][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:38:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:38:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:38:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:38:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:38:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:38:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:38:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:38:42,584][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:38:43,244][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:38:43,903][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:38:44,562][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:38:45,221][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:38:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:38:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:38:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:38:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:38:48,518][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:38:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:38:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:38:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:38:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:38:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:38:52,505][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:38:53,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:38:54,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:38:55,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:38:55,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:38:55,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:38:56,568][__main__][INFO] - Iteration 696 took 55s (14.05% Gen, 83.47% Train). Generation: 7s, Training: 45s. Estimated remaining time: 4h 59m 14s. Estimated total time: 15h 16m 46s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 23s. [2026-03-26 00:38:56,572][__main__][INFO] - Starting iteration 696. [2026-03-26 00:38:56,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:38:56,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:39:01,598][__main__][INFO] - Number of regex retries in iteration 696: 0 [2026-03-26 00:39:01,599][__main__][INFO] - agents played in iteration 696 are Alice, Bob [2026-03-26 00:39:02,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:02,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:02,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:39:02,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:39:03,069][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:39:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:39:04,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:39:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:39:05,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:39:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:39:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:39:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:39:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:39:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:39:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:39:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:39:10,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:39:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:39:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:39:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:39:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:39:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:39:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:39:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:39:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:39:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:39:17,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:39:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:39:18,839][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:39:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:39:20,157][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:39:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:39:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:39:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:39:22,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:39:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:39:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:39:24,771][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:39:25,430][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:39:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:39:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:39:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:39:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:39:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:39:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:39:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:39:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:39:31,361][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:39:32,021][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:39:32,679][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:39:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:39:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:39:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:39:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:39:36,302][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:39:36,960][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:39:37,620][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:39:38,279][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:39:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:39:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:39:40,257][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:39:40,917][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:39:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:39:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:39:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:39:43,555][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:39:44,214][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:39:44,875][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:39:45,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:39:46,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:39:47,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:39:47,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:39:47,581][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:39:48,822][__main__][INFO] - Iteration 697 took 52s (9.61% Gen, 88.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 12m 23s. Estimated total time: 14h 30m 47s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 23s. [2026-03-26 00:39:48,825][__main__][INFO] - Starting iteration 697. [2026-03-26 00:39:48,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:39:48,830][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:39:53,964][__main__][INFO] - Number of regex retries in iteration 697: 0 [2026-03-26 00:39:53,966][__main__][INFO] - agents played in iteration 697 are Alice, Bob [2026-03-26 00:39:54,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:54,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:54,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:39:54,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:39:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:39:55,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:39:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:39:57,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:39:57,832][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:39:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:39:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:39:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:40:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:40:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:40:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:40:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:40:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:40:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:40:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:40:05,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:40:05,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:40:06,396][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:40:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:40:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:40:08,376][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:40:09,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:40:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:40:10,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:40:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:40:11,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:40:12,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:40:12,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:40:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:40:14,308][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:40:14,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:40:15,626][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:40:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:40:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:40:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:40:18,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:40:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:40:19,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:40:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:40:20,903][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:40:21,563][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:40:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:40:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:40:23,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:40:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:40:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:40:25,522][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:40:26,190][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:40:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:40:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:40:28,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:40:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:40:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:40:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:40:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:40:32,006][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:40:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:40:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:40:33,983][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:40:34,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:40:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:40:35,960][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:40:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:40:37,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:40:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:40:38,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:40:39,930][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:40:39,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:40:39,935][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:40:41,253][__main__][INFO] - Iteration 698 took 52s (9.80% Gen, 87.69% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 14m 28s. Estimated total time: 14h 33m 45s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 52s. [2026-03-26 00:40:41,256][__main__][INFO] - Starting iteration 698. [2026-03-26 00:40:41,262][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:40:41,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:40:46,804][__main__][INFO] - Number of regex retries in iteration 698: 0 [2026-03-26 00:40:46,806][__main__][INFO] - agents played in iteration 698 are Alice, Bob [2026-03-26 00:40:47,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:40:47,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:40:47,507][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:40:47,508][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:40:48,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:40:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:40:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:40:50,099][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:40:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:40:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:40:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:40:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:40:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:40:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:40:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:40:55,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:40:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:40:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:40:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:40:57,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:40:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:40:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:40:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:41:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:41:01,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:41:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:41:02,608][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:41:03,266][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:41:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:41:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:41:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:41:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:41:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:41:07,218][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:41:07,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:41:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:41:09,194][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:41:09,853][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:41:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:41:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:41:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:41:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:41:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:41:13,803][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:41:14,462][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:41:15,120][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:41:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:41:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:41:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:41:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:41:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:41:19,071][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:41:20,061][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:41:20,749][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:41:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:41:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:41:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:41:23,383][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:41:24,042][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:41:24,701][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:41:25,360][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:41:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:41:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:41:27,344][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:41:28,003][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:41:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:41:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:41:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:41:30,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:41:31,555][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:41:32,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:41:32,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:41:32,763][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:41:33,979][__main__][INFO] - Iteration 699 took 52s (10.51% Gen, 87.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 18m 30s. Estimated total time: 14h 38m 40s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 52s, 500 more iterations: 7h 19m 20s. [2026-03-26 00:41:33,983][__main__][INFO] - Starting iteration 699. [2026-03-26 00:41:33,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:41:33,991][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:41:38,912][__main__][INFO] - Number of regex retries in iteration 699: 0 [2026-03-26 00:41:38,914][__main__][INFO] - agents played in iteration 699 are Alice, Bob [2026-03-26 00:41:39,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:41:39,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:41:39,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:41:39,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:41:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:41:41,052][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:41:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:41:42,372][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:41:43,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:41:43,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:41:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:41:45,007][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:41:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:41:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:41:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:41:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:41:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:41:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:41:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:41:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:41:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:41:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:41:52,266][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:41:52,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:41:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:41:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:41:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:41:55,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:41:56,220][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:41:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:41:57,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:41:58,196][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:41:58,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:41:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:42:00,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:42:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:42:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:42:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:42:02,812][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:42:03,471][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:42:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:42:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:42:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:42:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:42:06,766][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:42:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:42:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:42:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:42:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:42:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:42:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:42:11,381][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:42:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:42:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:42:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:42:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:42:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:42:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:42:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:42:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:42:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:42:18,309][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:42:18,969][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:42:19,628][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:42:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:42:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:42:21,607][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:42:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:42:22,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:42:23,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:42:24,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:42:24,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:42:24,797][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:42:26,124][__main__][INFO] - Iteration 700 took 52s (9.44% Gen, 88.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 7m 54s. Estimated total time: 14h 28m 56s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 28s. [2026-03-26 00:42:26,127][__main__][INFO] - Starting iteration 700. [2026-03-26 00:42:26,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:42:26,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:42:31,952][__main__][INFO] - Number of regex retries in iteration 700: 0 [2026-03-26 00:42:31,954][__main__][INFO] - agents played in iteration 700 are Alice, Bob [2026-03-26 00:42:32,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:42:32,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:42:32,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:42:32,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:42:33,381][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:42:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:42:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:42:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:42:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:42:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:42:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:42:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:42:38,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:42:39,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:42:39,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:42:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:42:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:42:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:42:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:42:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:42:43,953][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:42:44,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:42:45,271][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:42:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:42:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:42:47,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:42:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:42:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:42:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:42:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:42:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:42:51,339][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:42:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:42:52,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:42:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:42:53,974][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:42:54,634][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:42:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:42:55,952][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:42:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:42:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:42:57,929][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:42:58,587][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:42:59,246][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:42:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:43:00,564][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:43:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:43:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:43:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:43:03,206][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:43:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:43:04,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:43:05,565][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:43:06,226][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:43:06,886][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:43:07,545][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:43:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:43:08,865][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:43:09,525][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:43:10,185][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:43:10,848][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:43:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:43:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:43:12,828][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:43:13,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:43:14,150][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:43:14,809][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:43:15,469][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:43:16,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:43:16,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:43:18,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:43:18,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:43:18,136][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:43:20,584][__main__][INFO] - Iteration 701 took 54s (10.69% Gen, 84.81% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 45m 39s. Estimated total time: 15h 7m 35s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 45s, 500 more iterations: 7h 33m 47s. [2026-03-26 00:43:20,591][__main__][INFO] - Starting iteration 701. [2026-03-26 00:43:20,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:43:20,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:43:25,889][__main__][INFO] - Number of regex retries in iteration 701: 0 [2026-03-26 00:43:25,891][__main__][INFO] - agents played in iteration 701 are Alice, Bob [2026-03-26 00:43:26,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:43:26,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:43:26,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:43:26,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:43:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:43:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:43:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:43:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:43:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:43:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:43:31,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:43:31,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:43:32,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:43:33,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:43:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:43:34,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:43:35,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:43:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:43:36,488][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:43:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:43:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:43:38,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:43:39,125][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:43:39,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:43:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:43:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:43:41,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:43:42,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:43:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:43:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:43:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:43:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:43:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:43:46,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:43:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:43:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:43:48,355][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:43:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:43:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:43:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:43:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:43:51,652][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:43:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:43:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:43:53,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:43:54,289][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:43:54,948][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:43:55,607][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:43:56,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:43:56,926][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:43:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:43:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:43:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:43:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:44:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:44:01,178][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:44:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:44:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:44:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:44:03,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:44:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:44:05,133][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:44:05,793][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:44:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:44:07,112][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:44:07,772][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:44:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:44:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:44:09,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:44:10,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:44:11,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:44:11,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:44:11,706][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:44:12,889][__main__][INFO] - Iteration 702 took 52s (10.11% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 8m 36s. Estimated total time: 14h 31m 25s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 42s. [2026-03-26 00:44:12,892][__main__][INFO] - Starting iteration 702. [2026-03-26 00:44:12,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:44:12,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:44:21,374][__main__][INFO] - Number of regex retries in iteration 702: 0 [2026-03-26 00:44:21,376][__main__][INFO] - agents played in iteration 702 are Alice, Bob [2026-03-26 00:44:21,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:44:22,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:44:22,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:44:22,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:44:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:44:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:44:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:44:24,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:44:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:44:25,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:44:26,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:44:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:44:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:44:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:44:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:44:29,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:44:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:44:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:44:31,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:44:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:44:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:44:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:44:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:44:35,175][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:44:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:44:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:44:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:44:37,812][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:44:38,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:44:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:44:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:44:40,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:44:41,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:44:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:44:42,427][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:44:43,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:44:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:44:44,403][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:44:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:44:45,719][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:44:46,377][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:44:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:44:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:44:48,355][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:44:49,013][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:44:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:44:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:44:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:44:51,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:44:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:44:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:44:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:44:54,614][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:44:55,272][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:44:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:44:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:44:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:44:57,906][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:44:58,564][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:44:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:44:59,882][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:45:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:45:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:45:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:45:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:45:03,177][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:45:03,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:45:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:45:05,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:45:05,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:45:07,030][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:45:07,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:45:07,034][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:45:08,433][__main__][INFO] - Iteration 703 took 55s (15.26% Gen, 82.21% Train). Generation: 8s, Training: 45s. Estimated remaining time: 5h 1m 53s. Estimated total time: 15h 25m 38s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 49s. [2026-03-26 00:45:08,436][__main__][INFO] - Starting iteration 703. [2026-03-26 00:45:08,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:45:08,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:45:13,499][mllm.models.large_language_model_local][WARNING] - Response >A did not match regex: (|), retry 1/1 [2026-03-26 00:45:17,518][__main__][INFO] - Number of regex retries in iteration 703: 1 [2026-03-26 00:45:17,519][__main__][INFO] - agents played in iteration 703 are Alice, Bob [2026-03-26 00:45:18,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:45:18,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:45:18,407][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:45:18,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:45:19,053][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:45:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:45:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:45:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:45:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:45:22,293][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:45:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:45:23,609][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:45:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:45:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:45:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:45:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:45:26,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:45:27,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:45:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:45:28,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:45:29,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:45:30,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:45:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:45:31,513][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:45:32,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:45:32,830][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:45:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:45:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:45:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:45:35,467][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:45:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:45:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:45:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:45:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:45:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:45:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:45:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:45:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:45:41,394][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:45:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:45:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:45:43,370][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:45:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:45:44,687][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:45:45,345][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:45:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:45:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:45:47,320][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:45:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:45:48,637][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:45:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:45:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:45:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:45:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:45:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:45:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:45:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:45:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:45:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:45:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:45:56,214][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:45:56,873][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:45:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:45:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:45:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:45:59,508][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:46:00,167][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:46:00,825][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:46:01,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:46:02,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:46:03,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:46:03,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:46:03,306][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:46:04,607][__main__][INFO] - Iteration 704 took 56s (16.16% Gen, 81.52% Train). Generation: 9s, Training: 45s. Estimated remaining time: 5h 11m 27s. Estimated total time: 15h 36m 8s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 4s. [2026-03-26 00:46:04,610][__main__][INFO] - Starting iteration 704. [2026-03-26 00:46:04,614][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:46:04,615][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:46:09,710][__main__][INFO] - Number of regex retries in iteration 704: 0 [2026-03-26 00:46:09,712][__main__][INFO] - agents played in iteration 704 are Alice, Bob [2026-03-26 00:46:10,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:46:10,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:46:10,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:46:10,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:46:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:46:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:46:12,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:46:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:46:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:46:14,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:46:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:46:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:46:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:46:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:46:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:46:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:46:18,811][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:46:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:46:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:46:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:46:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:46:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:46:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:46:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:46:24,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:46:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:46:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:46:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:46:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:46:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:46:28,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:46:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:46:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:46:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:46:30,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:46:31,336][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:46:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:46:32,653][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:46:33,313][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:46:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:46:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:46:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:46:35,952][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:46:36,611][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:46:37,270][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:46:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:46:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:46:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:46:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:46:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:46:41,227][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:46:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:46:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:46:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:46:44,115][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:46:44,775][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:46:45,435][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:46:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:46:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:46:47,415][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:46:48,075][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:46:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:46:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:46:50,056][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:46:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:46:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:46:52,035][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:46:52,695][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:46:53,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:46:54,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:46:55,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:46:55,297][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:46:55,298][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:46:56,462][__main__][INFO] - Iteration 705 took 51s (9.83% Gen, 87.92% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 58m 37s. Estimated total time: 14h 24m 9s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2026-03-26 00:46:56,465][__main__][INFO] - Starting iteration 705. [2026-03-26 00:46:56,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:46:56,470][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:47:01,198][__main__][INFO] - Number of regex retries in iteration 705: 0 [2026-03-26 00:47:01,199][__main__][INFO] - agents played in iteration 705 are Alice, Bob [2026-03-26 00:47:01,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:01,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:01,783][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:47:01,784][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:47:02,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:47:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:47:03,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:47:04,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:47:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:47:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:47:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:47:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:47:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:47:08,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:47:08,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:47:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:47:10,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:47:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:47:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:47:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:47:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:47:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:47:14,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:47:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:47:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:47:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:47:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:47:17,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:47:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:47:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:47:19,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:47:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:47:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:47:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:47:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:47:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:47:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:47:24,161][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:47:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:47:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:47:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:47:26,803][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:47:27,462][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:47:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:47:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:47:29,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:47:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:47:30,762][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:47:31,423][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:47:32,084][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:47:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:47:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:47:34,356][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:47:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:47:35,751][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:47:36,415][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:47:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:47:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:47:38,393][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:47:39,053][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:47:39,712][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:47:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:47:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:47:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:47:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:47:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:47:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:47:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:47:44,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:47:45,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:47:46,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:47:46,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:47:46,825][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:47:48,129][__main__][INFO] - Iteration 706 took 51s (9.15% Gen, 88.31% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 54m 38s. Estimated total time: 14h 21m 2s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 31s. [2026-03-26 00:47:48,132][__main__][INFO] - Starting iteration 706. [2026-03-26 00:47:48,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:47:48,145][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:47:53,311][__main__][INFO] - Number of regex retries in iteration 706: 0 [2026-03-26 00:47:53,312][__main__][INFO] - agents played in iteration 706 are Alice, Bob [2026-03-26 00:47:53,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:54,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:54,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:47:54,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:47:54,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:47:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:47:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:47:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:47:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:47:57,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:47:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:47:59,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:47:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:48:00,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:48:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:48:01,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:48:02,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:48:03,206][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:48:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:48:04,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:48:05,184][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:48:07,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:48:07,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:48:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:48:09,137][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:48:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:48:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:48:11,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:48:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:48:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:48:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:48:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:48:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:48:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:48:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:48:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:48:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:48:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:48:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:48:19,022][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:48:19,681][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:48:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:48:20,999][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:48:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:48:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:48:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:48:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:48:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:48:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:48:25,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:48:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:48:27,279][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:48:27,939][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:48:28,599][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:48:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:48:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:48:30,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:48:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:48:31,895][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:48:32,554][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:48:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:48:33,871][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:48:34,529][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:48:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:48:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:48:36,503][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:48:37,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:48:37,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:48:39,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:48:39,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:48:39,138][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:48:40,438][__main__][INFO] - Iteration 707 took 52s (9.88% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 4m 19s. Estimated total time: 14h 31m 35s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 47s. [2026-03-26 00:48:40,441][__main__][INFO] - Starting iteration 707. [2026-03-26 00:48:40,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:48:40,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:48:45,421][__main__][INFO] - Number of regex retries in iteration 707: 0 [2026-03-26 00:48:45,423][__main__][INFO] - agents played in iteration 707 are Alice, Bob [2026-03-26 00:48:45,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:48:46,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:48:46,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:48:46,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:48:46,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:48:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:48:47,958][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:48:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:48:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:48:49,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:48:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:48:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:48:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:48:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:48:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:48:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:48:54,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:48:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:48:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:48:56,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:48:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:48:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:48:59,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:49:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:49:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:03,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:49:04,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:49:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:49:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:49:06,372][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:49:07,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:49:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:49:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:49:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:49:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:49:10,318][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:49:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:49:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:49:12,293][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:49:12,951][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:49:13,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:49:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:49:14,927][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:49:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:49:16,242][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:49:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:49:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:49:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:49:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:49:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:49:20,505][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:49:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:49:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:49:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:49:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:49:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:49:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:49:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:49:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:49:26,427][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:49:27,085][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:49:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:49:28,402][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:49:29,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:49:29,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:49:31,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:49:31,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:49:31,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:49:32,359][__main__][INFO] - Iteration 708 took 51s (9.58% Gen, 87.88% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 57m 7s. Estimated total time: 14h 25m 15s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 37s. [2026-03-26 00:49:32,361][__main__][INFO] - Starting iteration 708. [2026-03-26 00:49:32,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:49:32,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:49:37,362][__main__][INFO] - Number of regex retries in iteration 708: 0 [2026-03-26 00:49:37,363][__main__][INFO] - agents played in iteration 708 are Alice, Bob [2026-03-26 00:49:37,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:49:38,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:49:38,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:49:38,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:49:38,694][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:49:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:49:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:49:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:49:41,278][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:49:41,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:49:42,592][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:49:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:49:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:49:44,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:49:45,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:49:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:49:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:49:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:49:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:49:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:49:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:49:49,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:49:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:49:51,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:49:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:49:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:49:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:49:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:49:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:49:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:49:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:49:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:49:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:50:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:50:01,662][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:50:02,320][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:50:02,979][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:50:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:50:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:50:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:50:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:50:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:50:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:50:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:50:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:50:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:50:09,566][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:50:10,538][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:50:11,197][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:50:11,855][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:50:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:50:13,174][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:50:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:50:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:50:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:50:15,811][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:50:16,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:50:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:50:17,788][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:50:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:50:19,106][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:50:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:50:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:50:21,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:50:21,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:50:23,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:50:23,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:50:23,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:50:24,272][__main__][INFO] - Iteration 709 took 51s (9.63% Gen, 88.01% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 56m 8s. Estimated total time: 14h 25m 8s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 34s. [2026-03-26 00:50:24,274][__main__][INFO] - Starting iteration 709. [2026-03-26 00:50:24,278][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:50:24,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:50:29,549][__main__][INFO] - Number of regex retries in iteration 709: 0 [2026-03-26 00:50:29,550][__main__][INFO] - agents played in iteration 709 are Alice, Bob [2026-03-26 00:50:30,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:50:30,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:50:30,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:50:30,219][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:50:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:50:31,467][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:50:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:50:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:50:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:50:34,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:50:34,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:50:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:50:36,067][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:50:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:50:37,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:50:38,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:50:38,695][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:50:39,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:50:40,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:50:40,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:50:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:50:41,981][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:50:42,640][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:50:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:50:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:50:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:50:45,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:50:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:50:46,584][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:50:47,242][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:50:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:50:48,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:50:49,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:50:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:50:50,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:50:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:50:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:52,504][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:50:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:50:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:50:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:50:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:50:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:50:56,458][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:50:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:50:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:50:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:50:59,094][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:50:59,753][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:51:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:51:01,071][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:51:01,731][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:51:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:51:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:51:04,003][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:51:04,662][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:51:05,321][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:51:05,979][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:51:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:51:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:51:07,956][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:51:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:51:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:51:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:51:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:51:11,251][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:51:11,910][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:51:12,569][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:51:13,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:51:13,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:51:15,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:51:15,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:51:15,126][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:51:16,320][__main__][INFO] - Iteration 710 took 52s (10.13% Gen, 87.57% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 57m 31s. Estimated total time: 14h 27m 23s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 41s. [2026-03-26 00:51:16,322][__main__][INFO] - Starting iteration 710. [2026-03-26 00:51:16,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:51:16,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:51:23,442][__main__][INFO] - Number of regex retries in iteration 710: 0 [2026-03-26 00:51:23,444][__main__][INFO] - agents played in iteration 710 are Alice, Bob [2026-03-26 00:51:24,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:51:24,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:51:24,138][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:51:24,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:51:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:51:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:51:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:51:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:51:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:51:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:51:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:51:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:51:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:51:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:51:31,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:51:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:51:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:51:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:51:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:51:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:51:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:51:35,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:51:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:51:37,250][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:51:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:51:38,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:51:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:51:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:51:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:51:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:51:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:51:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:51:43,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:51:43,836][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:51:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:51:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:51:45,811][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:51:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:51:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:51:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:51:48,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:51:49,103][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:51:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:51:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:51:51,080][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:51:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:51:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:51:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:51:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:51:54,372][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:51:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:51:55,690][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:51:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:51:57,237][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:51:57,895][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:51:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:51:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:51:59,867][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:52:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:52:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:52:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:52:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:52:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:52:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:52:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:52:05,126][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:52:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:52:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:52:07,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:52:07,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:52:08,937][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:52:08,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:52:08,942][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:52:10,217][__main__][INFO] - Iteration 711 took 53s (13.20% Gen, 84.42% Train). Generation: 7s, Training: 45s. Estimated remaining time: 4h 27m 25s. Estimated total time: 14h 58m 11s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 49s, 500 more iterations: 7h 29m 5s. [2026-03-26 00:52:10,220][__main__][INFO] - Starting iteration 711. [2026-03-26 00:52:10,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:52:10,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:52:19,863][__main__][INFO] - Number of regex retries in iteration 711: 0 [2026-03-26 00:52:19,865][__main__][INFO] - agents played in iteration 711 are Alice, Bob [2026-03-26 00:52:20,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:52:20,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:52:20,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:52:20,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:52:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:52:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:52:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:52:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:52:23,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:52:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:52:25,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:52:25,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:52:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:52:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:52:27,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:52:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:52:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:52:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:52:30,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:52:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:52:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:52:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:52:32,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:52:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:52:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:52:34,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:52:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:52:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:52:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:52:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:52:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:52:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:52:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:52:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:52:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:52:41,446][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:52:42,103][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:52:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:52:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:52:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:52:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:52:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:52:46,048][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:52:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:52:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:52:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:52:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:52:49,338][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:52:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:52:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:52:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:52:51,969][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:52:52,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:52:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:52:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:52:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:52:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:52:56,154][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:52:56,812][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:52:57,469][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:52:58,127][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:52:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:52:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:53:00,101][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:53:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:53:01,417][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:53:02,075][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:53:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:53:03,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:53:04,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:53:05,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:53:05,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:53:05,240][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:53:06,511][__main__][INFO] - Iteration 712 took 56s (17.12% Gen, 80.61% Train). Generation: 9s, Training: 45s. Estimated remaining time: 5h 6m 26s. Estimated total time: 15h 38m 9s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 4s. [2026-03-26 00:53:06,513][__main__][INFO] - Starting iteration 712. [2026-03-26 00:53:06,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:53:06,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:53:11,269][__main__][INFO] - Number of regex retries in iteration 712: 0 [2026-03-26 00:53:11,270][__main__][INFO] - agents played in iteration 712 are Alice, Bob [2026-03-26 00:53:11,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:53:11,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:53:11,906][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:53:11,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:53:12,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:53:13,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:53:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:53:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:53:15,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:53:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:53:16,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:53:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:53:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:53:18,427][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:53:19,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:53:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:53:20,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:53:21,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:53:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:53:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:53:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:53:23,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:53:24,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:53:25,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:53:25,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:53:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:53:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:53:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:53:28,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:53:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:53:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:53:30,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:53:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:53:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:53:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:53:32,905][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:53:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:53:34,221][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:53:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:53:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:53:36,197][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:53:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:53:37,517][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:53:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:53:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:53:39,496][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:53:40,155][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:53:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:53:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:53:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:53:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:53:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:53:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:53:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:53:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:53:46,352][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:53:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:53:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:53:48,330][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:53:48,990][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:53:49,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:53:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:53:50,968][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:53:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:53:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:53:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:53:53,606][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:53:54,265][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:53:54,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:53:55,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:53:56,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:53:56,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:53:56,754][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:53:57,954][__main__][INFO] - Iteration 713 took 51s (9.24% Gen, 88.42% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 44m 44s. Estimated total time: 14h 17m 18s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 43s, 500 more iterations: 7h 8m 39s. [2026-03-26 00:53:57,956][__main__][INFO] - Starting iteration 713. [2026-03-26 00:53:57,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:53:57,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:54:02,835][__main__][INFO] - Number of regex retries in iteration 713: 0 [2026-03-26 00:54:02,836][__main__][INFO] - agents played in iteration 713 are Alice, Bob [2026-03-26 00:54:03,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:03,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:03,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:54:03,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:54:04,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:54:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:54:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:54:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:54:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:54:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:54:07,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:54:08,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:54:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:54:09,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:54:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:54:11,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:54:11,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:54:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:54:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:54:13,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:54:14,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:54:15,239][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:54:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:54:16,557][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:54:17,215][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:54:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:54:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:54:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:54:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:54:20,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:54:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:54:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:54:22,480][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:54:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:54:23,795][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:54:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:54:25,113][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:54:25,773][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:54:26,432][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:54:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:54:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:54:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:54:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:54:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:54:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:54:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:54:31,712][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:54:32,372][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:54:33,031][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:54:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:54:34,351][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:54:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:54:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:54:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:54:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:54:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:54:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:54:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:54:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:54:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:54:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:54:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:54:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:54:43,215][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:54:43,876][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:54:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:54:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:54:45,855][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:54:46,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:54:47,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:54:48,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:54:48,374][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:54:48,375][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:54:49,588][__main__][INFO] - Iteration 714 took 51s (9.44% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 47m 4s. Estimated total time: 14h 20m 30s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 3s, 500 more iterations: 7h 10m 15s. [2026-03-26 00:54:49,591][__main__][INFO] - Starting iteration 714. [2026-03-26 00:54:49,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:54:49,596][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:54:54,543][__main__][INFO] - Number of regex retries in iteration 714: 0 [2026-03-26 00:54:54,544][__main__][INFO] - agents played in iteration 714 are Alice, Bob [2026-03-26 00:54:55,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:55,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:55,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:54:55,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:54:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:54:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:54:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:54:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:54:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:54:59,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:54:59,662][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:55:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:55:00,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:55:01,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:55:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:55:02,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:55:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:55:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:55:04,932][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:55:05,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:55:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:55:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:55:07,568][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:55:08,227][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:55:08,887][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:55:09,545][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:55:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:55:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:55:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:55:12,182][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:55:12,841][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:55:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:55:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:55:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:55:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:55:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:55:16,797][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:55:17,456][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:55:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:55:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:55:19,434][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:55:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:55:20,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:55:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:55:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:55:22,730][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:55:23,389][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:55:24,048][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:55:24,707][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:55:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:55:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:55:26,687][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:55:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:55:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:55:28,912][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:55:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:55:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:55:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:55:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:55:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:55:32,867][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:55:33,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:55:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:55:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:55:35,503][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:55:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:55:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:55:37,481][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:55:38,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:55:38,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:55:39,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:55:39,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:55:39,966][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:55:41,243][__main__][INFO] - Iteration 715 took 51s (9.58% Gen, 87.94% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 46m 33s. Estimated total time: 14h 20m 50s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 5s, 500 more iterations: 7h 10m 25s. [2026-03-26 00:55:41,245][__main__][INFO] - Starting iteration 715. [2026-03-26 00:55:41,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:55:41,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:55:46,929][__main__][INFO] - Number of regex retries in iteration 715: 0 [2026-03-26 00:55:46,930][__main__][INFO] - agents played in iteration 715 are Alice, Bob [2026-03-26 00:55:47,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:47,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:47,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:55:47,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:55:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:55:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:55:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:55:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:55:50,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:55:51,503][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:55:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:55:52,819][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:55:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:55:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:55:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:55:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:55:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:55:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:55:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:55:58,089][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:55:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:55:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:56:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:56:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:56:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:56:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:56:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:56:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:56:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:56:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:56:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:56:05,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:56:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:56:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:56:07,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:56:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:56:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:56:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:56:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:56:11,273][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:56:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:56:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:56:13,252][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:56:13,911][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:56:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:56:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:56:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:56:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:56:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:56:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:56:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:56:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:56:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:56:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:56:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:56:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:56:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:56:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:56:24,048][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:56:24,707][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:56:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:56:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:56:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:56:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:56:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:56:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:56:29,320][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:56:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:56:30,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:56:31,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:56:32,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:56:32,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:56:32,626][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:56:33,957][__main__][INFO] - Iteration 716 took 52s (10.78% Gen, 86.69% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 3m 20s. Estimated total time: 14h 38m 29s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 14s. [2026-03-26 00:56:33,963][__main__][INFO] - Starting iteration 716. [2026-03-26 00:56:33,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:56:33,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:56:39,049][__main__][INFO] - Number of regex retries in iteration 716: 0 [2026-03-26 00:56:39,050][__main__][INFO] - agents played in iteration 716 are Alice, Bob [2026-03-26 00:56:39,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:56:39,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:56:39,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:56:39,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:56:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:56:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:56:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:56:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:56:42,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:56:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:56:44,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:56:44,819][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:56:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:56:46,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:56:46,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:56:47,454][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:56:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:56:48,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:56:49,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:56:50,089][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:56:50,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:56:51,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:56:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:56:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:56:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:56:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:56:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:56:55,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:56:56,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:56:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:56:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:56:57,997][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:56:58,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:56:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:56:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:57:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:57:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:57:01,951][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:57:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:57:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:57:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:57:04,587][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:57:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:57:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:57:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:57:07,224][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:57:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:57:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:57:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:57:09,861][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:57:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:57:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:57:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:57:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:57:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:57:14,072][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:57:14,733][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:57:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:57:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:57:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:57:17,372][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:57:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:57:18,693][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:57:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:57:20,013][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:57:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:57:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:57:21,994][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:57:22,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:57:23,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:57:24,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:57:24,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:57:24,614][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:57:25,903][__main__][INFO] - Iteration 717 took 51s (9.79% Gen, 87.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 49m 36s. Estimated total time: 14h 25m 38s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 49s. [2026-03-26 00:57:25,906][__main__][INFO] - Starting iteration 717. [2026-03-26 00:57:25,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:57:25,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:57:30,958][__main__][INFO] - Number of regex retries in iteration 717: 0 [2026-03-26 00:57:30,959][__main__][INFO] - agents played in iteration 717 are Alice, Bob [2026-03-26 00:57:31,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:57:31,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:57:31,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:57:31,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:57:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:57:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:57:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:57:34,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:57:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:57:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:57:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:57:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:57:37,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:57:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:57:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:57:39,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:57:40,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:57:40,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:57:41,520][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:57:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:57:42,838][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:57:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:57:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:57:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:57:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:57:46,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:57:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:57:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:57:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:57:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:57:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:57:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:57:50,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:57:51,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:57:52,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:57:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:57:53,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:57:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:57:54,699][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:57:55,358][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:57:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:57:56,676][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:57:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:57:57,994][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:57:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:57:59,312][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:57:59,971][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:58:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:58:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:58:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:58:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:58:03,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:58:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:58:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:58:05,571][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:58:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:58:06,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:58:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:58:08,206][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:58:08,865][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:58:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:58:10,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:58:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:58:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:58:12,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:58:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:58:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:58:14,137][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:58:14,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:58:15,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:58:16,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:58:16,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:58:16,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:58:18,006][__main__][INFO] - Iteration 718 took 52s (9.69% Gen, 87.70% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 51m 23s. Estimated total time: 14h 28m 17s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 8s. [2026-03-26 00:58:18,009][__main__][INFO] - Starting iteration 718. [2026-03-26 00:58:18,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:58:18,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:58:23,147][__main__][INFO] - Number of regex retries in iteration 718: 0 [2026-03-26 00:58:23,148][__main__][INFO] - agents played in iteration 718 are Alice, Bob [2026-03-26 00:58:23,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:58:23,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:58:23,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:58:23,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:58:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:58:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:58:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:58:26,497][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:58:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:58:27,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:58:28,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:58:29,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:58:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:58:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:58:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:58:31,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:58:32,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:58:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:58:33,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:58:34,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:58:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:58:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:58:36,379][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:58:37,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:58:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:58:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:58:39,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:58:39,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:58:40,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:58:40,995][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:58:41,654][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:58:42,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:58:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:58:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:58:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:58:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:58:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:58:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:58:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:58:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:58:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:58:48,904][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:58:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:58:50,224][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:58:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:58:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:58:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:58:52,861][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:58:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:58:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:58:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:58:55,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:58:56,492][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:58:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:58:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:58:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:58:59,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:58:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:59:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:59:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:59:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:59:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:59:03,084][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:59:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:59:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:59:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:59:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:59:06,380][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:59:07,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:59:07,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:59:08,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:59:08,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:59:08,971][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:59:10,270][__main__][INFO] - Iteration 719 took 52s (9.82% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 53m 12s. Estimated total time: 14h 30m 58s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 29s. [2026-03-26 00:59:10,273][__main__][INFO] - Starting iteration 719. [2026-03-26 00:59:10,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:59:10,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:59:15,142][__main__][INFO] - Number of regex retries in iteration 719: 0 [2026-03-26 00:59:15,143][__main__][INFO] - agents played in iteration 719 are Alice, Bob [2026-03-26 00:59:15,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:59:15,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:59:15,799][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:59:15,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:59:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:59:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:59:17,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:59:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:59:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:59:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:59:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:59:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:59:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:59:22,308][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:59:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:59:23,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:59:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:59:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:59:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:59:26,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:59:26,919][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:59:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:59:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:59:28,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:59:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:59:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:59:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:59:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:59:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:59:32,848][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:59:33,507][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:59:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:59:34,825][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:59:35,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:59:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:59:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:59:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:59:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:59:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:59:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:59:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:59:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:59:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:59:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:59:42,735][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:59:43,395][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:59:44,053][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:59:44,712][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:59:45,371][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:59:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:59:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:59:47,348][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:59:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:59:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:59:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:59:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:59:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:59:51,615][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:59:52,274][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:59:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:59:53,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:59:54,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:59:54,910][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:59:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:59:56,229][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:59:56,888][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:59:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:59:58,207][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:59:58,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:59:59,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:00:00,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:00:00,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:00:00,787][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:00:02,010][__main__][INFO] - Iteration 720 took 51s (9.40% Gen, 88.23% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 43m 37s. Estimated total time: 14h 22m 15s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 13s, 500 more iterations: 7h 11m 7s. [2026-03-26 01:00:02,013][__main__][INFO] - Starting iteration 720. [2026-03-26 01:00:02,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:00:02,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:00:07,210][__main__][INFO] - Number of regex retries in iteration 720: 0 [2026-03-26 01:00:07,211][__main__][INFO] - agents played in iteration 720 are Alice, Bob [2026-03-26 01:00:07,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:07,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:07,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:00:07,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:00:08,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:00:09,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:00:09,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:00:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:00:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:00:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:00:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:00:13,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:00:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:00:14,456][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:00:15,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:00:15,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:00:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:00:17,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:00:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:00:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:00:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:00:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:00:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:00:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:00:21,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:00:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:00:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:00:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:00:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:00:25,006][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:00:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:00:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:00:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:00:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:00:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:00:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:00:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:00:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:00:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:00:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:00:32,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:00:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:00:33,584][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:00:34,244][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:00:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:00:35,564][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:00:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:00:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:00:37,549][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:00:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:00:38,869][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:00:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:00:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:00:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:00:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:00:42,454][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:00:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:00:43,774][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:00:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:00:45,094][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:00:45,754][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:00:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:00:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:00:47,733][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:00:48,393][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:00:49,053][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:00:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:00:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:00:51,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:00:51,723][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:00:52,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:00:52,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:00:52,850][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:00:54,268][__main__][INFO] - Iteration 721 took 52s (9.94% Gen, 87.34% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 51m 22s. Estimated total time: 14h 30m 52s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 26s. [2026-03-26 01:00:54,271][__main__][INFO] - Starting iteration 721. [2026-03-26 01:00:54,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:00:54,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:00:59,276][__main__][INFO] - Number of regex retries in iteration 721: 0 [2026-03-26 01:00:59,278][__main__][INFO] - agents played in iteration 721 are Alice, Bob [2026-03-26 01:00:59,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:59,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:59,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:00:59,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:01:00,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:01:01,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:01:01,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:01:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:01:03,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:01:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:01:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:01:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:01:05,834][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:01:06,493][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:01:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:01:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:01:08,470][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:01:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:01:09,789][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:01:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:01:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:01:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:01:12,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:01:13,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:01:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:01:14,409][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:01:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:01:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:01:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:01:17,049][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:01:17,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:01:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:01:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:01:19,690][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:01:20,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:01:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:01:21,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:01:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:01:22,989][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:01:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:01:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:01:24,969][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:01:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:01:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:01:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:01:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:01:28,268][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:01:28,929][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:01:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:01:30,248][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:01:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:01:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:01:32,465][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:01:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:01:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:01:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:01:35,105][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:01:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:01:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:01:37,085][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:01:37,745][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:01:38,405][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:01:39,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:01:39,725][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:01:40,385][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:01:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:01:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:01:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:01:43,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:01:43,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:01:44,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:01:44,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:01:44,969][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:01:46,269][__main__][INFO] - Iteration 722 took 51s (9.62% Gen, 87.87% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 46m 13s. Estimated total time: 14h 26m 36s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 18s. [2026-03-26 01:01:46,272][__main__][INFO] - Starting iteration 722. [2026-03-26 01:01:46,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:01:46,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:01:51,185][__main__][INFO] - Number of regex retries in iteration 722: 0 [2026-03-26 01:01:51,186][__main__][INFO] - agents played in iteration 722 are Alice, Bob [2026-03-26 01:01:51,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:01:51,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:01:51,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:01:51,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:01:52,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:01:53,139][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:01:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:01:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:01:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:01:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:01:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:01:57,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:01:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:01:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:01:59,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:01:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:02:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:02:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:02:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:02:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:02:03,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:02:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:02:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:02:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:02:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:02:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:02:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:02:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:02:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:02:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:02:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:02:10,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:02:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:02:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:02:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:02:12,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:02:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:02:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:02:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:02:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:02:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:02:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:02:17,512][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:02:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:02:18,830][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:02:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:02:20,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:02:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:02:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:02:22,123][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:02:22,782][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:02:23,441][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:02:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:02:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:02:25,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:02:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:02:27,026][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:02:27,685][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:02:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:02:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:02:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:02:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:02:30,980][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:02:31,640][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:02:32,299][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:02:32,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:02:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:02:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:02:34,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:02:35,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:02:36,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:02:36,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:02:36,946][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:02:38,322][__main__][INFO] - Iteration 723 took 52s (9.42% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 46m 13s. Estimated total time: 14h 27m 27s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-26 01:02:38,325][__main__][INFO] - Starting iteration 723. [2026-03-26 01:02:38,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:02:38,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:02:43,263][__main__][INFO] - Number of regex retries in iteration 723: 0 [2026-03-26 01:02:43,264][__main__][INFO] - agents played in iteration 723 are Alice, Bob [2026-03-26 01:02:43,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:43,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:43,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:02:43,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:02:44,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:02:45,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:02:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:02:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:02:47,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:02:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:02:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:02:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:02:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:02:50,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:02:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:02:51,721][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:02:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:02:53,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:02:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:02:54,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:02:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:02:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:02:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:02:56,991][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:02:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:02:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:02:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:02:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:03:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:03:00,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:03:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:03:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:03:02,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:03:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:03:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:03:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:03:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:03:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:03:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:03:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:03:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:03:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:03:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:03:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:03:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:03:11,485][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:03:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:03:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:03:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:03:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:03:14,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:03:15,437][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:03:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:03:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:03:17,646][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:03:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:03:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:03:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:03:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:03:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:03:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:03:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:03:22,917][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:03:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:03:24,235][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:03:24,894][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:03:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:03:26,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:03:26,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:03:27,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:03:28,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:03:28,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:03:28,919][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:03:30,087][__main__][INFO] - Iteration 724 took 51s (9.53% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 40m 34s. Estimated total time: 14h 22m 40s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 20s. [2026-03-26 01:03:30,090][__main__][INFO] - Starting iteration 724. [2026-03-26 01:03:30,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:03:30,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:03:35,314][__main__][INFO] - Number of regex retries in iteration 724: 0 [2026-03-26 01:03:35,314][__main__][INFO] - agents played in iteration 724 are Alice, Bob [2026-03-26 01:03:35,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:03:35,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:03:35,982][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:03:35,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:03:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:03:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:03:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:03:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:03:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:03:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:03:40,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:03:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:03:41,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:03:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:03:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:03:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:03:44,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:03:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:03:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:03:46,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:03:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:03:47,790][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:03:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:03:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:03:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:03:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:03:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:03:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:03:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:03:53,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:03:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:03:54,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:03:55,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:03:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:03:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:03:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:03:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:03:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:03:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:03:59,633][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:04:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:04:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:04:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:04:02,265][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:04:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:04:03,581][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:04:04,239][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:04:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:04:05,556][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:04:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:04:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:04:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:04:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:04:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:04:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:04:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:04:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:04:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:04:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:04:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:04:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:04:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:04:15,058][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:04:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:04:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:04:17,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:04:17,692][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:04:18,351][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:04:19,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:04:19,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:04:20,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:04:20,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:04:20,795][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:04:22,058][__main__][INFO] - Iteration 725 took 51s (10.04% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 43m 8s. Estimated total time: 14h 26m 6s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 3s. [2026-03-26 01:04:22,061][__main__][INFO] - Starting iteration 725. [2026-03-26 01:04:22,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:04:22,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:04:27,173][__main__][INFO] - Number of regex retries in iteration 725: 0 [2026-03-26 01:04:27,174][__main__][INFO] - agents played in iteration 725 are Alice, Bob [2026-03-26 01:04:27,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:04:27,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:04:27,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:04:27,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:04:28,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:04:29,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:04:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:04:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:04:31,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:04:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:04:32,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:04:33,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:04:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:04:34,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:04:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:04:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:04:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:04:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:04:37,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:04:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:04:38,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:04:39,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:04:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:04:40,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:04:41,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:04:42,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:04:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:04:43,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:04:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:04:44,855][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:04:45,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:04:46,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:04:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:04:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:04:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:04:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:04:49,458][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:04:50,116][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:04:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:04:51,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:04:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:04:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:04:53,405][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:04:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:04:54,721][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:04:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:04:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:04:56,695][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:04:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:04:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:04:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:04:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:05:00,223][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:05:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:05:01,539][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:05:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:05:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:05:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:05:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:05:04,831][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:05:05,488][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:05:06,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:05:06,804][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:05:07,462][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:05:08,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:05:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:05:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:05:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:05:10,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:05:11,478][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:05:12,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:05:12,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:05:12,597][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:05:14,016][__main__][INFO] - Iteration 726 took 51s (9.83% Gen, 87.43% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 42m 3s. Estimated total time: 14h 25m 52s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 56s. [2026-03-26 01:05:14,018][__main__][INFO] - Starting iteration 726. [2026-03-26 01:05:14,022][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:05:14,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:05:19,031][__main__][INFO] - Number of regex retries in iteration 726: 0 [2026-03-26 01:05:19,032][__main__][INFO] - agents played in iteration 726 are Alice, Bob [2026-03-26 01:05:19,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:05:19,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:05:19,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:05:19,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:05:20,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:05:21,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:05:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:05:22,348][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:05:23,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:05:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:05:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:05:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:05:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:05:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:05:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:05:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:05:28,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:05:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:05:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:05:30,250][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:05:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:05:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:05:32,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:05:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:05:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:05:34,203][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:05:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:05:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:05:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:05:36,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:05:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:05:38,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:05:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:05:39,476][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:05:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:05:40,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:05:41,453][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:05:42,112][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:05:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:05:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:05:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:05:44,749][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:05:45,408][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:05:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:05:46,726][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:05:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:05:48,045][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:05:48,704][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:05:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:05:50,022][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:05:50,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:05:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:05:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:05:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:05:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:05:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:05:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:05:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:05:56,219][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:05:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:05:57,537][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:05:58,197][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:05:58,856][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:05:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:06:00,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:06:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:06:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:06:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:06:02,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:06:03,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:06:04,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:06:04,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:06:04,643][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:06:05,864][__main__][INFO] - Iteration 727 took 51s (9.66% Gen, 87.98% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 39m 21s. Estimated total time: 14h 24m 3s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 1s. [2026-03-26 01:06:05,866][__main__][INFO] - Starting iteration 727. [2026-03-26 01:06:05,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:06:05,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:06:10,820][__main__][INFO] - Number of regex retries in iteration 727: 0 [2026-03-26 01:06:10,822][__main__][INFO] - agents played in iteration 727 are Alice, Bob [2026-03-26 01:06:11,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:06:11,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:06:11,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:06:11,555][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:06:12,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:06:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:06:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:06:14,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:06:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:06:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:06:16,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:06:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:06:17,403][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:06:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:06:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:06:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:06:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:06:20,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:06:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:06:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:06:22,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:06:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:06:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:06:24,637][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:06:25,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:06:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:06:26,611][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:06:27,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:06:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:06:28,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:06:29,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:06:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:06:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:06:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:06:31,873][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:06:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:06:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:06:33,847][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:06:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:06:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:06:35,821][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:06:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:06:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:06:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:06:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:06:39,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:06:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:06:40,429][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:06:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:06:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:06:42,405][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:06:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:06:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:06:44,656][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:06:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:06:45,972][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:06:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:06:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:06:47,946][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:06:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:06:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:06:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:06:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:06:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:06:51,893][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:06:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:06:53,208][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:06:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:06:54,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:06:55,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:06:56,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:06:56,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:06:56,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:06:57,793][__main__][INFO] - Iteration 728 took 51s (9.53% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 39m 51s. Estimated total time: 14h 25m 24s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 42s. [2026-03-26 01:06:57,796][__main__][INFO] - Starting iteration 728. [2026-03-26 01:06:57,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:06:57,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:07:03,496][__main__][INFO] - Number of regex retries in iteration 728: 0 [2026-03-26 01:07:03,497][__main__][INFO] - agents played in iteration 728 are Alice, Bob [2026-03-26 01:07:04,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:04,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:04,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:07:04,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:07:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:07:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:07:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:07:06,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:07:07,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:07:08,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:07:08,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:07:09,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:07:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:07:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:07:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:07:11,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:07:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:07:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:07:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:07:14,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:07:15,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:07:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:07:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:07:17,223][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:07:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:07:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:07:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:07:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:07:20,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:07:21,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:07:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:07:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:07:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:07:23,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:07:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:07:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:07:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:07:26,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:07:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:07:27,765][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:07:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:07:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:07:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:07:30,405][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:07:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:07:31,723][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:07:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:07:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:07:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:07:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:07:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:07:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:07:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:07:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:07:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:07:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:07:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:07:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:07:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:07:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:07:41,866][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:07:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:07:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:07:43,845][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:07:44,505][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:07:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:07:45,824][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:07:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:07:47,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:07:47,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:07:48,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:07:48,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:07:48,944][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:07:50,258][__main__][INFO] - Iteration 729 took 52s (10.86% Gen, 86.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 47m 53s. Estimated total time: 14h 34m 19s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 9s. [2026-03-26 01:07:50,260][__main__][INFO] - Starting iteration 729. [2026-03-26 01:07:50,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:07:50,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:07:57,070][__main__][INFO] - Number of regex retries in iteration 729: 0 [2026-03-26 01:07:57,072][__main__][INFO] - agents played in iteration 729 are Alice, Bob [2026-03-26 01:07:57,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:57,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:57,720][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:07:57,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:07:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:07:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:07:59,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:08:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:08:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:08:01,719][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:08:02,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:08:03,036][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:08:03,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:08:04,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:08:05,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:08:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:08:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:08:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:08:07,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:08:08,310][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:08:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:08:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:08:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:08:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:08:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:08:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:08:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:08:13,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:08:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:08:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:08:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:08:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:08:16,882][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:08:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:08:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:08:18,861][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:08:19,522][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:08:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:08:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:08:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:08:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:08:22,819][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:08:23,478][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:08:24,137][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:08:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:08:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:08:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:08:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:08:27,436][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:08:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:08:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:08:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:08:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:08:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:08:31,723][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:08:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:08:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:08:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:08:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:08:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:08:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:08:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:08:36,999][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:08:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:08:38,318][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:08:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:08:39,637][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:08:40,297][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:08:40,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:08:41,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:08:42,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:08:43,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:08:43,002][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:08:44,313][__main__][INFO] - Iteration 730 took 54s (12.59% Gen, 84.98% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 13m 30s. Estimated total time: 15h 0m 50s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 5s, 500 more iterations: 7h 30m 25s. [2026-03-26 01:08:44,315][__main__][INFO] - Starting iteration 730. [2026-03-26 01:08:44,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:08:44,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:08:54,822][__main__][INFO] - Number of regex retries in iteration 730: 0 [2026-03-26 01:08:54,824][__main__][INFO] - agents played in iteration 730 are Alice, Bob [2026-03-26 01:08:55,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:08:55,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:08:55,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:08:55,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:08:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:08:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:08:57,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:08:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:08:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:08:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:09:00,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:09:00,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:09:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:09:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:09:02,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:09:03,371][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:09:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:09:04,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:09:05,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:09:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:09:06,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:09:07,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:09:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:09:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:09:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:09:09,956][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:09:10,615][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:09:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:09:11,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:09:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:09:13,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:09:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:09:14,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:09:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:09:15,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:09:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:09:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:09:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:09:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:09:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:09:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:09:20,493][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:09:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:09:21,811][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:09:22,469][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:09:23,127][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:09:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:09:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:09:25,104][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:09:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:09:26,422][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:09:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:09:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:09:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:09:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:09:30,005][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:09:30,664][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:09:31,322][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:09:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:09:32,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:09:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:09:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:09:34,616][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:09:35,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:09:35,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:09:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:09:37,250][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:09:37,908][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:09:38,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:09:39,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:09:40,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:09:40,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:09:40,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:09:41,777][__main__][INFO] - Iteration 731 took 57s (18.28% Gen, 79.36% Train). Generation: 10s, Training: 45s. Estimated remaining time: 5h 9m 22s. Estimated total time: 15h 57m 39s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 45s, 500 more iterations: 7h 58m 49s. [2026-03-26 01:09:41,780][__main__][INFO] - Starting iteration 731. [2026-03-26 01:09:41,783][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:09:41,784][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:09:46,861][__main__][INFO] - Number of regex retries in iteration 731: 0 [2026-03-26 01:09:46,862][__main__][INFO] - agents played in iteration 731 are Alice, Bob [2026-03-26 01:09:47,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:47,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:47,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:09:47,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:09:48,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:09:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:09:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:09:50,162][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:09:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:09:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:09:52,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:09:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:09:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:09:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:09:54,769][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:09:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:09:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:09:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:09:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:09:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:09:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:09:59,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:10:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:10:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:10:01,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:10:02,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:10:02,682][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:10:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:10:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:10:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:10:05,323][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:10:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:10:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:10:07,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:10:07,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:10:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:10:09,277][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:10:09,937][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:10:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:10:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:10:11,913][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:10:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:10:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:10:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:10:14,549][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:10:15,209][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:10:15,867][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:10:16,527][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:10:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:10:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:10:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:10:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:10:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:10:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:10:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:10:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:10:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:10:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:10:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:10:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:10:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:10:26,082][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:10:26,742][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:10:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:10:28,060][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:10:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:10:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:10:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:10:30,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:10:31,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:10:32,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:10:32,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:10:32,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:10:33,952][__main__][INFO] - Iteration 732 took 52s (9.73% Gen, 87.79% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 40m 20s. Estimated total time: 14h 29m 30s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 45s. [2026-03-26 01:10:33,956][__main__][INFO] - Starting iteration 732. [2026-03-26 01:10:33,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:10:33,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:10:39,365][__main__][INFO] - Number of regex retries in iteration 732: 0 [2026-03-26 01:10:39,367][__main__][INFO] - agents played in iteration 732 are Alice, Bob [2026-03-26 01:10:39,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:10:40,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:10:40,006][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:10:40,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:10:40,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:10:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:10:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:10:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:10:43,243][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:10:43,901][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:10:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:10:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:10:45,876][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:10:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:10:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:10:47,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:10:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:10:49,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:10:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:10:50,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:10:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:10:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:10:52,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:10:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:10:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:10:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:10:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:10:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:10:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:10:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:10:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:10:58,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:10:59,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:10:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:11:00,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:11:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:11:01,684][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:11:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:11:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:11:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:11:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:11:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:11:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:11:06,293][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:11:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:11:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:11:08,269][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:11:08,927][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:11:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:11:10,245][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:11:10,905][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:11:11,564][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:11:12,554][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:11:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:11:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:11:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:11:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:11:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:11:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:11:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:11:17,816][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:11:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:11:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:11:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:11:20,446][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:11:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:11:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:11:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:11:23,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:11:23,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:11:24,977][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:11:24,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:11:24,981][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:11:26,273][__main__][INFO] - Iteration 733 took 52s (10.33% Gen, 87.19% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 41m 52s. Estimated total time: 14h 31m 54s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 57s. [2026-03-26 01:11:26,278][__main__][INFO] - Starting iteration 733. [2026-03-26 01:11:26,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:11:26,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:11:31,438][__main__][INFO] - Number of regex retries in iteration 733: 0 [2026-03-26 01:11:31,439][__main__][INFO] - agents played in iteration 733 are Alice, Bob [2026-03-26 01:11:32,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:11:32,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:11:32,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:11:32,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:11:32,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:11:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:11:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:11:34,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:11:35,352][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:11:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:11:36,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:11:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:11:37,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:11:38,637][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:11:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:11:39,952][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:11:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:11:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:11:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:11:42,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:11:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:11:43,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:11:44,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:11:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:11:45,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:11:46,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:11:47,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:11:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:11:48,496][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:11:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:11:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:11:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:11:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:11:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:11:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:11:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:11:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:11:54,414][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:11:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:11:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:11:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:11:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:11:57,704][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:11:58,361][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:11:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:11:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:12:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:12:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:12:01,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:12:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:12:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:12:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:12:04,598][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:12:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:12:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:12:06,570][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:12:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:12:07,886][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:12:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:12:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:12:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:12:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:12:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:12:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:12:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:12:13,146][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:12:13,804][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:12:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:12:15,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:12:15,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:12:16,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:12:16,985][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:12:16,986][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:12:18,224][__main__][INFO] - Iteration 734 took 51s (9.93% Gen, 87.69% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 34m 49s. Estimated total time: 14h 25m 43s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 51s. [2026-03-26 01:12:18,229][__main__][INFO] - Starting iteration 734. [2026-03-26 01:12:18,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:12:18,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:12:23,603][__main__][INFO] - Number of regex retries in iteration 734: 0 [2026-03-26 01:12:23,605][__main__][INFO] - agents played in iteration 734 are Alice, Bob [2026-03-26 01:12:24,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:12:24,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:12:24,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:12:24,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:12:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:12:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:12:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:12:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:12:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:12:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:12:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:12:29,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:12:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:12:30,831][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:12:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:12:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:12:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:12:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:12:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:12:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:12:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:12:36,100][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:12:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:12:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:12:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:12:38,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:12:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:12:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:12:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:12:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:12:42,026][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:12:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:12:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:12:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:12:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:12:45,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:12:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:12:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:12:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:12:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:12:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:12:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:12:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:12:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:12:51,248][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:12:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:12:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:12:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:12:53,883][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:12:54,542][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:12:55,200][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:12:55,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:12:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:12:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:12:58,157][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:12:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:12:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:13:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:13:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:13:01,451][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:13:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:13:02,769][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:13:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:13:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:13:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:13:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:13:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:13:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:13:07,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:13:08,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:13:09,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:13:09,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:13:09,232][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:13:10,513][__main__][INFO] - Iteration 735 took 52s (10.27% Gen, 87.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 39m 34s. Estimated total time: 14h 31m 21s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 40s. [2026-03-26 01:13:10,515][__main__][INFO] - Starting iteration 735. [2026-03-26 01:13:10,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:13:10,520][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:13:15,298][__main__][INFO] - Number of regex retries in iteration 735: 0 [2026-03-26 01:13:15,299][__main__][INFO] - agents played in iteration 735 are Alice, Bob [2026-03-26 01:13:15,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:13:15,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:13:15,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:13:15,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:13:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:13:17,248][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:13:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:13:18,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:13:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:13:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:13:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:13:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:13:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:13:22,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:13:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:13:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:13:24,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:13:25,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:13:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:13:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:13:27,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:13:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:13:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:13:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:13:29,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:13:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:13:31,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:13:31,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:13:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:13:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:13:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:13:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:13:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:13:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:13:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:13:36,993][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:13:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:13:38,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:13:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:13:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:13:40,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:13:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:13:41,601][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:13:42,259][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:13:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:13:43,576][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:13:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:13:44,892][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:13:45,551][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:13:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:13:46,867][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:13:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:13:48,479][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:13:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:13:49,797][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:13:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:13:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:13:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:13:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:13:53,092][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:13:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:13:54,411][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:13:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:13:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:13:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:13:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:13:57,709][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:13:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:13:59,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:13:59,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:14:01,001][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:14:01,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:14:01,009][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:14:02,294][__main__][INFO] - Iteration 736 took 51s (9.23% Gen, 88.28% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 30m 18s. Estimated total time: 14h 22m 56s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 17s, 500 more iterations: 7h 11m 28s. [2026-03-26 01:14:02,296][__main__][INFO] - Starting iteration 736. [2026-03-26 01:14:02,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:14:02,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:14:09,173][__main__][INFO] - Number of regex retries in iteration 736: 0 [2026-03-26 01:14:09,175][__main__][INFO] - agents played in iteration 736 are Alice, Bob [2026-03-26 01:14:09,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:14:09,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:14:09,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:14:09,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:14:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:14:11,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:14:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:14:12,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:14:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:14:13,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:14:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:14:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:14:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:14:16,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:14:16,982][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:14:17,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:14:18,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:14:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:14:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:14:20,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:14:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:14:21,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:14:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:14:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:14:23,573][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:14:24,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:14:24,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:14:25,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:14:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:14:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:14:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:14:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:14:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:14:29,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:14:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:14:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:14:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:14:32,141][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:14:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:14:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:14:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:14:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:14:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:14:36,097][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:14:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:14:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:14:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:14:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:14:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:14:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:14:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:14:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:14:42,270][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:14:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:14:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:14:44,249][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:14:44,908][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:14:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:14:46,226][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:14:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:14:47,545][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:14:48,204][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:14:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:14:49,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:14:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:14:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:14:51,500][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:14:52,159][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:14:52,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:14:53,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:14:54,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:14:54,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:14:54,825][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:14:56,112][__main__][INFO] - Iteration 737 took 53s (12.77% Gen, 84.83% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 3m 22s. Estimated total time: 14h 56m 54s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 41s, 500 more iterations: 7h 28m 27s. [2026-03-26 01:14:56,115][__main__][INFO] - Starting iteration 737. [2026-03-26 01:14:56,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:14:56,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:15:00,943][__main__][INFO] - Number of regex retries in iteration 737: 0 [2026-03-26 01:15:00,944][__main__][INFO] - agents played in iteration 737 are Alice, Bob [2026-03-26 01:15:01,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:01,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:01,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:15:01,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:15:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:15:02,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:15:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:15:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:15:04,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:15:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:15:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:15:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:15:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:15:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:15:08,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:15:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:15:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:15:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:15:11,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:15:12,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:15:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:15:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:15:14,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:15:14,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:15:15,440][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:15:16,099][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:15:16,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:15:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:15:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:15:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:15:19,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:15:20,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:15:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:15:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:15:22,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:15:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:15:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:15:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:15:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:15:25,333][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:15:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:15:26,653][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:15:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:15:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:15:28,632][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:15:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:15:29,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:15:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:15:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:15:31,932][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:15:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:15:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:15:34,201][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:15:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:15:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:15:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:15:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:15:37,497][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:15:38,156][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:15:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:15:39,473][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:15:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:15:40,792][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:15:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:15:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:15:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:15:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:15:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:15:44,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:15:45,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:15:46,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:15:46,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:15:46,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:15:47,879][__main__][INFO] - Iteration 738 took 51s (9.32% Gen, 88.29% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 28m 18s. Estimated total time: 14h 22m 42s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 21s. [2026-03-26 01:15:47,881][__main__][INFO] - Starting iteration 738. [2026-03-26 01:15:47,885][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:15:47,886][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:15:52,710][__main__][INFO] - Number of regex retries in iteration 738: 0 [2026-03-26 01:15:52,713][__main__][INFO] - agents played in iteration 738 are Alice, Bob [2026-03-26 01:15:53,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:53,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:53,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:15:53,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:15:53,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:15:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:15:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:15:55,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:15:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:15:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:15:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:15:58,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:15:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:15:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:16:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:16:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:16:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:16:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:16:03,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:16:03,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:16:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:16:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:16:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:16:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:16:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:16:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:16:08,391][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:16:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:16:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:16:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:16:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:16:11,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:16:12,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:16:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:16:13,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:16:14,310][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:16:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:16:15,626][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:16:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:16:16,941][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:16:17,599][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:16:18,257][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:16:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:16:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:16:20,232][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:16:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:16:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:16:22,205][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:16:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:16:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:16:24,179][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:16:24,838][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:16:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:16:26,381][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:16:27,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:16:27,697][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:16:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:16:29,013][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:16:29,671][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:16:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:16:30,987][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:16:31,644][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:16:32,302][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:16:32,960][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:16:33,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:16:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:16:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:16:35,590][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:16:36,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:16:36,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:16:38,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:16:38,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:16:38,237][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:16:39,700][__main__][INFO] - Iteration 739 took 51s (9.32% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 28m 21s. Estimated total time: 14h 23m 36s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 48s. [2026-03-26 01:16:39,703][__main__][INFO] - Starting iteration 739. [2026-03-26 01:16:39,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:16:39,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:16:44,467][__main__][INFO] - Number of regex retries in iteration 739: 0 [2026-03-26 01:16:44,468][__main__][INFO] - agents played in iteration 739 are Alice, Bob [2026-03-26 01:16:45,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:16:45,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:16:45,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:16:45,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:16:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:16:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:16:47,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:16:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:16:48,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:16:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:16:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:16:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:16:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:16:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:16:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:16:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:16:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:16:54,529][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:16:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:16:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:16:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:16:57,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:16:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:16:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:16:59,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:16:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:17:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:17:01,104][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:17:01,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:17:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:17:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:17:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:17:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:17:05,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:17:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:17:06,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:17:07,022][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:17:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:17:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:17:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:17:09,652][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:17:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:17:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:17:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:17:12,283][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:17:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:17:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:17:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:17:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:17:15,572][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:17:16,229][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:17:16,887][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:17:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:17:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:17:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:17:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:17:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:17:21,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:17:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:17:22,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:17:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:17:23,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:17:24,405][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:17:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:17:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:17:26,380][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:17:27,038][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:17:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:17:28,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:17:29,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:17:30,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:17:30,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:17:30,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:17:31,566][__main__][INFO] - Iteration 740 took 51s (9.18% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 28m 12s. Estimated total time: 14h 24m 19s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 9s. [2026-03-26 01:17:31,569][__main__][INFO] - Starting iteration 740. [2026-03-26 01:17:31,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:17:31,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:17:43,914][__main__][INFO] - Number of regex retries in iteration 740: 0 [2026-03-26 01:17:43,916][__main__][INFO] - agents played in iteration 740 are Alice, Bob [2026-03-26 01:17:44,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:44,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:44,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:17:44,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:17:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:17:45,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:17:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:17:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:17:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:17:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:17:49,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:17:49,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:17:50,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:17:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:17:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:17:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:17:53,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:17:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:17:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:17:55,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:17:55,716][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:17:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:17:57,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:17:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:17:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:17:59,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:17:59,658][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:18:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:18:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:18:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:18:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:18:02,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:18:03,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:18:04,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:18:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:18:05,572][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:18:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:18:06,887][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:18:07,545][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:18:08,203][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:18:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:18:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:18:10,175][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:18:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:18:11,491][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:18:12,148][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:18:12,805][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:18:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:18:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:18:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:18:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:18:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:18:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:18:17,630][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:18:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:18:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:18:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:18:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:18:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:18:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:18:22,233][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:18:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:18:23,548][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:18:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:18:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:18:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:18:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:18:26,837][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:18:27,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:18:28,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:18:29,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:18:29,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:18:29,351][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:18:30,656][__main__][INFO] - Iteration 741 took 59s (20.89% Gen, 76.90% Train). Generation: 12s, Training: 45s. Estimated remaining time: 5h 27m 37s. Estimated total time: 16h 24m 44s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 28s, 500 more iterations: 8h 12m 22s. [2026-03-26 01:18:30,658][__main__][INFO] - Starting iteration 741. [2026-03-26 01:18:30,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:18:30,663][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:18:35,456][__main__][INFO] - Number of regex retries in iteration 741: 0 [2026-03-26 01:18:35,457][__main__][INFO] - agents played in iteration 741 are Alice, Bob [2026-03-26 01:18:36,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:18:36,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:18:36,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:18:36,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:18:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:18:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:18:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:18:38,847][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:18:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:18:40,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:18:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:18:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:18:42,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:18:42,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:18:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:18:44,102][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:18:44,760][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:18:45,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:18:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:18:46,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:18:47,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:18:48,045][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:18:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:18:49,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:18:50,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:18:50,675][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:18:51,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:18:51,989][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:18:52,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:18:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:18:53,961][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:18:54,618][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:18:55,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:18:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:18:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:18:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:18:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:18:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:18:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:18:59,880][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:19:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:19:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:19:01,852][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:19:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:19:03,166][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:19:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:19:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:19:05,138][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:19:05,795][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:19:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:19:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:19:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:19:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:19:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:19:10,012][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:19:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:19:11,327][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:19:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:19:12,642][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:19:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:19:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:19:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:19:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:19:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:19:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:19:17,246][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:19:17,904][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:19:18,561][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:19:19,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:19:19,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:19:21,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:19:21,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:19:21,017][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:19:22,230][__main__][INFO] - Iteration 742 took 51s (9.30% Gen, 88.35% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 21m 30s. Estimated total time: 14h 19m 28s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 56s, 500 more iterations: 7h 9m 44s. [2026-03-26 01:19:22,232][__main__][INFO] - Starting iteration 742. [2026-03-26 01:19:22,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:19:22,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:19:27,092][__main__][INFO] - Number of regex retries in iteration 742: 0 [2026-03-26 01:19:27,093][__main__][INFO] - agents played in iteration 742 are Alice, Bob [2026-03-26 01:19:27,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:19:27,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:19:27,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:19:27,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:19:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:19:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:19:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:19:30,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:19:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:19:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:19:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:19:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:19:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:19:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:19:34,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:19:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:19:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:19:36,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:19:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:19:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:19:38,850][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:19:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:19:40,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:19:40,821][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:19:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:19:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:19:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:19:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:19:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:19:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:19:45,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:19:46,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:19:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:19:47,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:19:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:19:48,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:19:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:19:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:19:50,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:19:51,335][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:19:51,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:19:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:19:53,307][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:19:53,965][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:19:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:19:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:19:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:19:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:19:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:19:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:19:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:19:59,224][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:20:00,104][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:20:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:20:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:20:02,077][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:20:02,734][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:20:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:20:04,048][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:20:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:20:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:20:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:20:06,677][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:20:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:20:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:20:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:20:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:20:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:20:10,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:20:11,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:20:12,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:20:12,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:20:12,513][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:20:13,798][__main__][INFO] - Iteration 743 took 51s (9.42% Gen, 88.08% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 20m 34s. Estimated total time: 14h 19m 23s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 56s, 500 more iterations: 7h 9m 41s. [2026-03-26 01:20:13,801][__main__][INFO] - Starting iteration 743. [2026-03-26 01:20:13,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:20:13,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:20:20,951][__main__][INFO] - Number of regex retries in iteration 743: 0 [2026-03-26 01:20:20,952][__main__][INFO] - agents played in iteration 743 are Alice, Bob [2026-03-26 01:20:21,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:20:21,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:20:21,627][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:20:21,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:20:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:20:22,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:20:23,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:20:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:20:24,856][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:20:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:20:26,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:20:26,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:20:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:20:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:20:28,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:20:29,455][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:20:30,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:20:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:20:31,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:20:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:20:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:20:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:20:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:20:34,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:20:35,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:20:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:20:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:20:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:20:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:20:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:20:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:20:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:20:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:20:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:20:41,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:20:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:20:43,256][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:20:43,913][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:20:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:20:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:20:45,886][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:20:46,544][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:20:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:20:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:20:48,518][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:20:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:20:49,835][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:20:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:20:51,151][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:20:51,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:20:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:20:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:20:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:20:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:20:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:20:56,034][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:20:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:20:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:20:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:20:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:20:59,328][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:20:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:21:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:21:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:21:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:21:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:21:03,283][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:21:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:21:04,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:21:05,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:21:06,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:21:06,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:21:06,542][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:21:08,072][__main__][INFO] - Iteration 744 took 54s (13.17% Gen, 84.01% Train). Generation: 7s, Training: 45s. Estimated remaining time: 4h 4m 44s. Estimated total time: 15h 4m 28s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 26s, 500 more iterations: 7h 32m 14s. [2026-03-26 01:21:08,074][__main__][INFO] - Starting iteration 744. [2026-03-26 01:21:08,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:21:08,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:21:13,771][__main__][INFO] - Number of regex retries in iteration 744: 0 [2026-03-26 01:21:13,772][__main__][INFO] - agents played in iteration 744 are Alice, Bob [2026-03-26 01:21:14,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:21:14,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:21:14,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:21:14,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:21:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:21:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:21:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:21:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:21:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:21:18,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:21:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:21:19,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:21:20,361][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:21:21,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:21:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:21:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:21:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:21:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:21:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:21:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:21:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:21:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:21:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:21:27,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:21:28,254][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:21:28,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:21:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:21:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:21:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:21:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:21:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:21:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:21:33,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:21:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:21:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:21:35,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:21:36,149][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:21:36,808][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:21:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:21:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:21:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:21:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:21:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:21:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:21:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:21:42,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:21:42,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:21:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:21:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:21:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:21:45,364][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:21:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:21:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:21:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:21:48,226][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:21:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:21:49,542][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:21:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:21:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:21:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:21:52,174][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:21:52,831][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:21:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:21:54,148][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:21:54,807][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:21:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:21:56,124][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:21:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:21:57,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:21:58,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:21:59,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:21:59,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:21:59,303][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:22:00,502][__main__][INFO] - Iteration 745 took 52s (10.86% Gen, 86.85% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 33m 9s. Estimated total time: 14h 33m 45s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 52s. [2026-03-26 01:22:00,504][__main__][INFO] - Starting iteration 745. [2026-03-26 01:22:00,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:22:00,509][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:22:08,686][__main__][INFO] - Number of regex retries in iteration 745: 0 [2026-03-26 01:22:08,687][__main__][INFO] - agents played in iteration 745 are Alice, Bob [2026-03-26 01:22:09,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:22:09,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:22:09,395][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:22:09,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:22:10,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:22:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:22:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:22:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:22:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:22:13,361][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:22:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:22:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:22:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:22:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:22:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:22:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:22:17,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:22:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:22:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:22:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:22:20,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:22:21,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:22:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:22:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:22:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:22:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:22:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:22:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:22:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:22:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:22:27,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:22:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:22:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:22:29,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:22:29,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:22:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:22:31,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:22:31,785][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:22:32,442][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:22:33,099][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:22:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:22:34,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:22:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:22:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:22:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:22:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:22:37,700][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:22:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:22:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:22:39,673][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:22:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:22:40,988][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:22:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:22:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:22:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:22:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:22:44,597][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:22:45,254][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:22:45,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:22:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:22:47,225][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:22:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:22:48,540][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:22:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:22:49,854][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:22:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:22:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:22:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:22:52,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:22:53,177][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:22:54,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:22:54,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:22:54,378][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:22:55,558][__main__][INFO] - Iteration 746 took 55s (14.86% Gen, 82.99% Train). Generation: 8s, Training: 45s. Estimated remaining time: 4h 16m 1s. Estimated total time: 15h 17m 32s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 46s. [2026-03-26 01:22:55,561][__main__][INFO] - Starting iteration 746. [2026-03-26 01:22:55,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:22:55,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:23:04,230][__main__][INFO] - Number of regex retries in iteration 746: 0 [2026-03-26 01:23:04,232][__main__][INFO] - agents played in iteration 746 are Alice, Bob [2026-03-26 01:23:04,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:04,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:04,828][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:23:04,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:23:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:23:06,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:23:06,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:23:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:23:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:23:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:23:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:23:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:23:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:23:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:23:11,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:23:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:23:13,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:23:13,961][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:23:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:23:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:23:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:23:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:23:17,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:23:17,905][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:23:18,562][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:23:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:23:19,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:23:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:23:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:23:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:23:22,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:23:23,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:23:23,816][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:23:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:23:25,129][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:23:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:23:26,444][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:23:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:23:27,758][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:23:28,415][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:23:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:23:29,730][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:23:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:23:31,043][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:23:31,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:23:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:23:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:23:33,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:23:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:23:34,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:23:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:23:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:23:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:23:37,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:23:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:23:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:23:39,893][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:23:40,550][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:23:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:23:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:23:42,520][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:23:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:23:43,835][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:23:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:23:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:23:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:23:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:23:47,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:23:47,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:23:48,450][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:23:49,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:23:49,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:23:49,649][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:23:51,001][__main__][INFO] - Iteration 747 took 55s (15.63% Gen, 81.92% Train). Generation: 8s, Training: 45s. Estimated remaining time: 4h 21m 31s. Estimated total time: 15h 23m 58s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 59s. [2026-03-26 01:23:51,004][__main__][INFO] - Starting iteration 747. [2026-03-26 01:23:51,008][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:23:51,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:23:55,796][__main__][INFO] - Number of regex retries in iteration 747: 0 [2026-03-26 01:23:55,797][__main__][INFO] - agents played in iteration 747 are Alice, Bob [2026-03-26 01:23:56,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:56,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:56,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:23:56,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:23:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:23:57,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:23:58,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:23:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:23:59,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:24:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:24:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:24:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:24:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:24:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:24:03,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:24:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:24:05,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:24:05,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:24:06,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:24:06,997][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:24:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:24:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:24:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:24:09,628][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:24:10,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:24:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:24:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:24:12,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:24:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:24:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:24:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:24:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:24:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:24:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:24:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:24:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:24:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:24:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:24:19,491][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:24:20,148][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:24:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:24:21,463][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:24:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:24:22,777][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:24:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:24:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:24:24,749][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:24:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:24:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:24:26,722][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:24:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:24:28,038][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:24:28,926][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:24:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:24:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:24:30,897][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:24:31,555][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:24:32,213][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:24:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:24:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:24:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:24:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:24:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:24:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:24:36,814][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:24:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:24:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:24:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:24:39,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:24:40,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:24:41,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:24:41,361][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:24:41,363][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:24:42,564][__main__][INFO] - Iteration 748 took 51s (9.29% Gen, 88.38% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 15m 58s. Estimated total time: 14h 19m 17s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 55s, 500 more iterations: 7h 9m 38s. [2026-03-26 01:24:42,566][__main__][INFO] - Starting iteration 748. [2026-03-26 01:24:42,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:24:42,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:24:47,396][__main__][INFO] - Number of regex retries in iteration 748: 0 [2026-03-26 01:24:47,397][__main__][INFO] - agents played in iteration 748 are Alice, Bob [2026-03-26 01:24:48,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:24:48,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:24:48,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:24:48,080][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:24:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:24:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:24:50,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:24:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:24:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:24:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:24:52,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:24:53,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:24:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:24:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:24:55,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:24:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:24:56,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:24:57,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:24:57,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:24:58,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:24:59,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:24:59,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:25:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:25:01,233][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:25:01,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:25:02,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:25:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:25:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:25:04,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:25:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:25:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:25:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:25:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:25:07,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:25:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:25:09,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:25:09,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:25:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:25:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:25:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:25:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:25:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:25:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:25:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:25:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:25:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:25:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:25:17,034][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:25:17,693][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:25:18,352][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:25:19,010][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:25:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:25:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:25:21,249][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:25:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:25:22,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:25:23,225][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:25:23,883][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:25:24,542][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:25:25,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:25:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:25:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:25:27,179][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:25:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:25:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:25:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:25:29,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:25:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:25:31,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:25:32,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:25:33,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:25:33,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:25:33,119][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:25:34,381][__main__][INFO] - Iteration 749 took 51s (9.31% Gen, 88.25% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 19m 22s. Estimated total time: 14h 23m 32s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 46s. [2026-03-26 01:25:34,383][__main__][INFO] - Starting iteration 749. [2026-03-26 01:25:34,387][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:25:34,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:25:39,438][__main__][INFO] - Number of regex retries in iteration 749: 0 [2026-03-26 01:25:39,439][__main__][INFO] - agents played in iteration 749 are Alice, Bob [2026-03-26 01:25:39,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:40,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:40,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:25:40,051][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:25:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:25:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:25:42,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:25:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:25:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:25:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:25:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:25:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:25:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:25:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:25:47,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:25:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:25:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:25:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:25:49,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:25:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:25:51,212][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:25:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:25:52,528][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:25:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:25:53,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:25:54,502][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:25:55,160][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:25:55,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:25:56,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:25:57,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:25:57,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:25:58,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:25:59,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:25:59,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:26:00,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:26:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:26:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:26:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:26:03,064][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:26:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:26:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:26:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:26:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:26:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:26:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:26:07,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:26:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:26:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:26:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:26:10,305][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:26:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:26:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:26:12,586][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:26:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:26:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:26:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:26:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:26:15,879][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:26:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:26:17,196][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:26:17,854][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:26:18,512][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:26:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:26:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:26:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:26:21,146][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:26:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:26:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:26:23,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:26:23,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:26:24,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:26:24,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:26:24,976][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:26:26,172][__main__][INFO] - Iteration 750 took 51s (9.75% Gen, 87.93% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 18m 4s. Estimated total time: 14h 23m 6s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 33s. [2026-03-26 01:26:26,174][__main__][INFO] - Starting iteration 750. [2026-03-26 01:26:26,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:26:26,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:26:31,058][__main__][INFO] - Number of regex retries in iteration 750: 0 [2026-03-26 01:26:31,059][__main__][INFO] - agents played in iteration 750 are Alice, Bob [2026-03-26 01:26:31,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:26:31,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:26:31,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:26:31,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:26:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:26:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:26:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:26:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:26:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:26:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:26:36,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:26:36,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:26:37,604][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:26:38,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:26:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:26:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:26:40,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:26:40,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:26:41,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:26:42,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:26:42,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:26:43,519][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:26:44,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:26:44,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:26:45,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:26:46,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:26:46,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:26:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:26:48,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:26:48,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:26:49,434][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:26:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:26:50,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:26:51,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:26:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:26:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:26:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:26:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:26:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:26:55,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:26:56,007][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:26:56,665][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:26:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:26:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:26:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:26:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:26:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:27:00,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:27:01,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:27:01,923][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:27:02,581][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:27:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:27:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:27:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:27:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:27:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:27:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:27:07,411][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:27:08,068][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:27:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:27:09,382][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:27:10,040][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:27:10,697][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:27:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:27:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:27:12,669][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:27:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:27:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:27:14,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:27:15,364][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:27:16,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:27:16,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:27:16,494][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:27:19,108][__main__][INFO] - Iteration 751 took 52s (9.22% Gen, 85.84% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 36m 17s. Estimated total time: 14h 42m 12s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 13s, 500 more iterations: 7h 21m 6s. [2026-03-26 01:27:19,111][__main__][INFO] - Starting iteration 751. [2026-03-26 01:27:19,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:27:19,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:27:25,005][__main__][INFO] - Number of regex retries in iteration 751: 0 [2026-03-26 01:27:25,006][__main__][INFO] - agents played in iteration 751 are Alice, Bob [2026-03-26 01:27:25,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:27:25,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:27:25,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:27:25,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:27:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:27:26,894][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:27:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:27:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:27:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:27:29,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:27:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:27:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:27:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:27:32,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:27:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:27:33,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:27:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:27:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:27:35,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:27:36,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:27:36,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:27:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:27:38,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:27:38,729][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:27:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:27:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:27:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:27:41,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:27:42,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:27:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:27:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:27:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:27:44,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:27:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:27:45,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:27:46,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:27:47,295][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:27:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:27:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:27:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:27:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:27:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:27:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:27:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:27:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:27:53,228][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:27:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:27:54,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:27:55,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:27:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:27:56,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:27:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:27:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:27:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:27:59,471][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:28:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:28:00,788][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:28:01,448][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:28:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:28:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:28:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:28:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:28:04,744][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:28:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:28:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:28:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:28:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:28:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:28:08,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:28:09,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:28:10,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:28:10,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:28:10,562][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:28:11,782][__main__][INFO] - Iteration 752 took 52s (11.18% Gen, 86.49% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 31m 1s. Estimated total time: 14h 37m 49s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 54s. [2026-03-26 01:28:11,785][__main__][INFO] - Starting iteration 752. [2026-03-26 01:28:11,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:28:11,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:28:17,054][__main__][INFO] - Number of regex retries in iteration 752: 0 [2026-03-26 01:28:17,055][__main__][INFO] - agents played in iteration 752 are Alice, Bob [2026-03-26 01:28:17,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:28:17,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:28:17,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:28:17,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:28:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:28:19,116][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:28:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:28:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:28:21,090][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:28:21,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:28:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:28:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:28:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:28:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:28:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:28:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:28:26,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:28:27,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:28:27,671][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:28:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:28:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:28:29,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:28:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:28:30,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:28:31,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:28:32,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:28:32,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:28:33,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:28:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:28:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:28:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:28:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:28:36,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:28:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:28:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:28:38,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:28:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:28:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:28:40,852][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:28:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:28:42,170][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:28:42,829][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:28:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:28:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:28:44,807][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:28:45,466][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:28:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:28:46,785][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:28:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:28:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:28:48,765][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:28:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:28:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:28:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:28:51,645][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:28:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:28:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:28:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:28:54,282][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:28:54,941][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:28:55,601][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:28:56,260][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:28:56,920][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:28:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:28:58,238][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:28:58,897][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:28:59,556][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:29:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:29:00,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:29:01,599][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:29:02,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:29:02,683][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:29:02,684][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:29:04,016][__main__][INFO] - Iteration 753 took 52s (10.08% Gen, 87.36% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 22m 49s. Estimated total time: 14h 30m 29s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 14s. [2026-03-26 01:29:04,019][__main__][INFO] - Starting iteration 753. [2026-03-26 01:29:04,024][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:29:04,024][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:29:09,984][__main__][INFO] - Number of regex retries in iteration 753: 0 [2026-03-26 01:29:09,985][__main__][INFO] - agents played in iteration 753 are Alice, Bob [2026-03-26 01:29:10,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:29:10,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:29:10,598][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:29:10,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:29:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:29:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:29:12,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:29:13,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:29:13,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:29:14,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:29:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:29:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:29:16,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:29:17,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:29:17,778][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:29:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:29:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:29:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:29:20,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:29:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:29:21,731][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:29:22,390][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:29:23,049][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:29:23,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:29:24,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:29:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:29:25,683][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:29:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:29:26,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:29:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:29:28,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:29:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:29:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:29:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:29:30,943][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:29:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:29:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:29:32,916][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:29:33,574][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:29:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:29:34,889][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:29:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:29:36,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:29:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:29:37,520][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:29:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:29:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:29:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:29:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:29:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:29:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:29:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:29:43,030][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:29:43,688][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:29:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:29:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:29:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:29:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:29:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:29:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:29:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:29:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:29:49,606][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:29:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:29:50,921][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:29:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:29:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:29:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:29:53,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:29:54,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:29:55,506][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:29:55,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:29:55,510][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:29:56,676][__main__][INFO] - Iteration 754 took 52s (11.32% Gen, 86.46% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 29m 1s. Estimated total time: 14h 37m 34s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 47s. [2026-03-26 01:29:56,679][__main__][INFO] - Starting iteration 754. [2026-03-26 01:29:56,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:29:56,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:30:01,597][__main__][INFO] - Number of regex retries in iteration 754: 0 [2026-03-26 01:30:01,599][__main__][INFO] - agents played in iteration 754 are Alice, Bob [2026-03-26 01:30:02,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:02,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:02,303][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:30:02,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:30:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:30:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:30:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:30:04,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:30:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:30:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:30:06,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:30:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:30:08,172][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:30:08,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:30:09,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:30:10,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:30:10,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:30:11,458][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:30:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:30:12,771][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:30:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:30:14,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:30:14,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:30:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:30:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:30:16,716][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:30:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:30:18,031][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:30:18,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:30:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:30:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:30:20,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:30:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:30:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:30:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:30:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:30:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:30:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:30:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:30:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:30:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:30:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:30:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:30:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:30:29,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:30:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:30:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:30:31,174][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:30:31,832][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:30:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:30:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:30:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:30:34,684][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:30:35,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:30:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:30:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:30:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:30:37,971][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:30:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:30:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:30:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:30:40,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:30:41,261][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:30:41,919][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:30:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:30:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:30:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:30:44,549][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:30:45,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:30:45,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:30:47,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:30:47,117][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:30:47,121][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:30:48,370][__main__][INFO] - Iteration 755 took 51s (9.51% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 12m 4s. Estimated total time: 14h 21m 28s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 8s, 500 more iterations: 7h 10m 44s. [2026-03-26 01:30:48,373][__main__][INFO] - Starting iteration 755. [2026-03-26 01:30:48,379][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:30:48,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:30:54,218][__main__][INFO] - Number of regex retries in iteration 755: 0 [2026-03-26 01:30:54,219][__main__][INFO] - agents played in iteration 755 are Alice, Bob [2026-03-26 01:30:54,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:54,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:54,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:30:54,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:30:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:30:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:30:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:30:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:30:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:30:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:30:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:31:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:31:00,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:31:01,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:31:02,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:31:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:31:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:31:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:31:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:31:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:31:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:31:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:31:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:31:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:31:08,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:31:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:31:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:31:10,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:31:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:31:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:31:12,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:31:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:31:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:31:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:31:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:31:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:31:16,472][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:31:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:31:17,786][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:31:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:31:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:31:19,757][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:31:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:31:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:31:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:31:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:31:23,043][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:31:23,700][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:31:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:31:25,016][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:31:25,675][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:31:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:31:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:31:27,913][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:31:28,571][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:31:29,228][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:31:29,885][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:31:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:31:31,199][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:31:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:31:32,513][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:31:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:31:33,827][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:31:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:31:35,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:31:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:31:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:31:37,113][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:31:37,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:31:38,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:31:39,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:31:39,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:31:39,548][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:31:40,953][__main__][INFO] - Iteration 756 took 52s (11.11% Gen, 86.22% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 25m 59s. Estimated total time: 14h 36m 16s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 8s. [2026-03-26 01:31:40,965][__main__][INFO] - Starting iteration 756. [2026-03-26 01:31:40,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:31:40,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:31:47,708][__main__][INFO] - Number of regex retries in iteration 756: 0 [2026-03-26 01:31:47,710][__main__][INFO] - agents played in iteration 756 are Alice, Bob [2026-03-26 01:31:48,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:31:48,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:31:48,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:31:48,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:31:49,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:31:49,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:31:50,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:31:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:31:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:31:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:31:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:31:53,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:31:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:31:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:31:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:31:56,307][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:31:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:31:57,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:31:58,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:31:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:31:59,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:32:00,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:32:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:32:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:32:02,220][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:32:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:32:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:32:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:32:04,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:32:05,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:32:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:32:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:32:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:32:08,134][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:32:08,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:32:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:32:10,106][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:32:10,764][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:32:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:32:12,078][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:32:12,735][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:32:13,392][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:32:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:32:14,707][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:32:15,364][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:32:16,021][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:32:16,679][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:32:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:32:17,993][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:32:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:32:19,308][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:32:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:32:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:32:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:32:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:32:22,810][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:32:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:32:24,125][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:32:24,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:32:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:32:26,098][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:32:26,755][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:32:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:32:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:32:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:32:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:32:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:32:30,699][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:32:31,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:32:32,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:32:33,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:32:33,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:32:33,260][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:32:34,546][__main__][INFO] - Iteration 757 took 53s (12.57% Gen, 85.02% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 41m 43s. Estimated total time: 14h 52m 53s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 17s, 500 more iterations: 7h 26m 26s. [2026-03-26 01:32:34,548][__main__][INFO] - Starting iteration 757. [2026-03-26 01:32:34,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:32:34,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:32:39,419][__main__][INFO] - Number of regex retries in iteration 757: 0 [2026-03-26 01:32:39,421][__main__][INFO] - agents played in iteration 757 are Alice, Bob [2026-03-26 01:32:39,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:32:40,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:32:40,006][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:32:40,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:32:40,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:32:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:32:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:32:42,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:32:43,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:32:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:32:44,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:32:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:32:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:32:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:32:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:32:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:32:48,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:32:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:32:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:32:50,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:32:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:32:51,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:32:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:32:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:32:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:32:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:32:55,049][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:32:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:32:56,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:32:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:32:57,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:32:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:32:58,992][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:32:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:33:00,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:33:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:33:01,621][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:33:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:33:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:33:03,592][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:33:04,249][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:33:04,906][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:33:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:33:06,221][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:33:06,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:33:07,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:33:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:33:08,850][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:33:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:33:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:33:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:33:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:33:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:33:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:33:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:33:14,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:33:15,013][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:33:15,671][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:33:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:33:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:33:17,643][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:33:18,301][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:33:18,958][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:33:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:33:20,272][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:33:20,930][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:33:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:33:22,245][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:33:22,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:33:23,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:33:24,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:33:24,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:33:24,682][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:33:25,954][__main__][INFO] - Iteration 758 took 51s (9.46% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 4m 37s. Estimated total time: 14h 16m 39s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 39s, 500 more iterations: 7h 8m 19s. [2026-03-26 01:33:25,957][__main__][INFO] - Starting iteration 758. [2026-03-26 01:33:25,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:33:25,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:33:30,899][__main__][INFO] - Number of regex retries in iteration 758: 0 [2026-03-26 01:33:30,900][__main__][INFO] - agents played in iteration 758 are Alice, Bob [2026-03-26 01:33:31,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:33:31,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:33:31,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:33:31,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:33:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:33:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:33:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:33:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:33:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:33:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:33:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:33:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:33:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:33:38,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:33:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:33:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:33:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:33:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:33:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:33:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:33:42,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:33:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:33:43,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:33:44,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:33:45,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:33:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:33:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:33:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:33:47,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:33:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:33:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:33:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:33:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:33:51,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:33:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:33:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:33:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:33:53,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:33:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:33:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:33:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:33:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:33:57,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:33:57,779][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:33:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:33:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:33:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:34:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:34:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:34:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:34:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:34:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:34:03,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:34:04,616][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:34:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:34:05,934][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:34:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:34:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:34:07,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:34:08,571][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:34:09,230][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:34:09,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:34:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:34:11,207][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:34:11,871][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:34:12,527][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:34:13,188][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:34:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:34:14,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:34:15,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:34:16,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:34:16,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:34:16,855][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:34:18,099][__main__][INFO] - Iteration 759 took 52s (9.47% Gen, 88.14% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 16m 5s. Estimated total time: 14h 28m 59s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 29s. [2026-03-26 01:34:18,101][__main__][INFO] - Starting iteration 759. [2026-03-26 01:34:18,107][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:34:18,108][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:34:23,828][__main__][INFO] - Number of regex retries in iteration 759: 0 [2026-03-26 01:34:23,830][__main__][INFO] - agents played in iteration 759 are Alice, Bob [2026-03-26 01:34:24,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:34:24,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:34:24,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:34:24,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:34:25,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:34:25,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:34:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:34:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:34:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:34:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:34:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:34:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:34:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:34:31,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:34:31,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:34:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:34:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:34:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:34:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:34:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:34:35,638][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:34:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:34:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:34:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:34:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:34:38,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:34:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:34:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:34:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:34:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:34:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:34:42,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:34:43,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:34:44,205][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:34:44,865][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:34:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:34:46,184][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:34:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:34:47,504][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:34:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:34:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:34:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:34:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:34:50,799][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:34:51,458][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:34:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:34:52,776][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:34:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:34:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:34:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:34:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:34:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:34:57,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:34:57,717][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:34:58,376][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:34:59,037][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:34:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:35:00,354][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:35:01,013][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:35:01,672][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:35:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:35:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:35:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:35:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:35:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:35:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:35:06,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:35:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:35:07,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:35:08,349][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:35:09,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:35:09,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:35:09,521][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:35:10,832][__main__][INFO] - Iteration 760 took 52s (10.85% Gen, 86.65% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 25m 0s. Estimated total time: 14h 38m 47s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 52s, 500 more iterations: 7h 19m 23s. [2026-03-26 01:35:10,835][__main__][INFO] - Starting iteration 760. [2026-03-26 01:35:10,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:35:10,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:35:16,182][__main__][INFO] - Number of regex retries in iteration 760: 0 [2026-03-26 01:35:16,183][__main__][INFO] - agents played in iteration 760 are Alice, Bob [2026-03-26 01:35:16,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:16,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:16,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:35:16,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:35:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:35:18,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:35:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:35:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:35:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:35:20,778][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:35:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:35:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:35:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:35:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:35:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:35:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:35:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:35:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:35:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:35:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:35:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:35:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:35:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:35:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:35:30,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:35:31,329][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:35:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:35:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:35:33,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:35:33,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:35:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:35:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:35:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:35:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:35:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:35:37,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:35:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:35:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:35:39,903][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:35:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:35:41,222][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:35:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:35:42,539][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:35:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:35:43,858][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:35:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:35:45,176][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:35:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:35:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:35:47,154][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:35:47,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:35:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:35:49,458][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:35:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:35:50,776][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:35:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:35:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:35:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:35:53,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:35:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:35:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:35:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:35:56,047][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:35:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:35:57,365][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:35:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:35:58,683][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:35:59,341][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:36:00,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:36:00,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:36:01,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:36:01,962][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:36:01,963][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:36:03,202][__main__][INFO] - Iteration 761 took 52s (10.20% Gen, 87.43% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 18m 5s. Estimated total time: 14h 32m 44s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 22s. [2026-03-26 01:36:03,205][__main__][INFO] - Starting iteration 761. [2026-03-26 01:36:03,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:36:03,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:36:08,094][__main__][INFO] - Number of regex retries in iteration 761: 0 [2026-03-26 01:36:08,095][__main__][INFO] - agents played in iteration 761 are Alice, Bob [2026-03-26 01:36:08,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:36:08,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:36:08,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:36:08,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:36:09,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:36:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:36:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:36:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:36:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:36:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:36:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:36:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:36:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:36:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:36:15,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:36:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:36:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:36:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:36:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:36:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:36:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:36:20,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:36:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:36:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:36:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:36:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:36:23,777][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:36:24,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:36:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:36:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:36:26,409][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:36:27,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:36:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:36:28,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:36:29,042][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:36:29,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:36:30,359][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:36:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:36:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:36:32,335][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:36:32,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:36:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:36:34,309][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:36:34,968][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:36:35,626][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:36:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:36:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:36:37,600][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:36:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:36:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:36:39,576][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:36:40,234][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:36:41,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:36:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:36:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:36:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:36:43,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:36:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:36:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:36:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:36:46,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:36:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:36:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:36:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:36:49,071][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:36:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:36:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:36:51,048][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:36:51,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:36:52,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:36:53,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:36:53,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:36:53,517][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:36:54,730][__main__][INFO] - Iteration 762 took 51s (9.48% Gen, 88.16% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 3m 12s. Estimated total time: 14h 18m 42s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 52s, 500 more iterations: 7h 9m 21s. [2026-03-26 01:36:54,733][__main__][INFO] - Starting iteration 762. [2026-03-26 01:36:54,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:36:54,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:36:59,864][__main__][INFO] - Number of regex retries in iteration 762: 0 [2026-03-26 01:36:59,866][__main__][INFO] - agents played in iteration 762 are Alice, Bob [2026-03-26 01:37:00,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:00,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:00,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:37:00,525][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:37:01,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:37:01,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:37:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:37:03,133][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:37:03,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:37:04,449][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:37:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:37:05,764][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:37:06,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:37:07,079][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:37:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:37:08,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:37:09,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:37:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:37:10,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:37:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:37:11,685][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:37:12,343][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:37:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:37:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:37:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:37:14,976][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:37:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:37:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:37:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:37:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:37:18,267][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:37:18,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:37:19,584][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:37:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:37:20,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:37:21,558][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:37:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:37:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:37:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:37:24,191][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:37:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:37:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:37:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:37:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:37:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:37:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:37:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:37:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:37:30,117][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:37:30,775][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:37:31,433][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:37:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:37:33,048][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:37:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:37:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:37:35,023][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:37:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:37:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:37:36,997][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:37:37,656][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:37:38,314][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:37:38,973][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:37:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:37:40,290][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:37:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:37:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:37:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:37:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:37:43,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:37:44,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:37:45,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:37:45,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:37:45,358][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:37:46,685][__main__][INFO] - Iteration 763 took 51s (9.87% Gen, 87.57% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 9m 27s. Estimated total time: 14h 25m 50s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 55s. [2026-03-26 01:37:46,688][__main__][INFO] - Starting iteration 763. [2026-03-26 01:37:46,693][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:37:46,693][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:37:52,079][__main__][INFO] - Number of regex retries in iteration 763: 0 [2026-03-26 01:37:52,080][__main__][INFO] - agents played in iteration 763 are Alice, Bob [2026-03-26 01:37:52,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:52,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:52,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:37:52,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:37:53,418][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:37:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:37:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:37:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:37:56,000][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:37:56,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:37:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:37:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:37:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:37:59,291][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:37:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:38:00,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:38:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:38:01,925][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:38:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:38:03,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:38:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:38:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:38:05,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:38:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:38:06,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:38:07,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:38:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:38:08,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:38:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:38:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:38:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:38:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:38:11,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:38:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:38:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:38:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:38:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:38:15,093][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:38:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:38:16,410][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:38:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:38:17,727][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:38:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:38:19,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:38:19,702][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:38:20,361][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:38:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:38:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:38:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:38:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:38:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:38:24,312][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:38:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:38:25,863][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:38:26,522][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:38:27,181][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:38:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:38:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:38:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:38:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:38:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:38:31,133][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:38:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:38:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:38:33,108][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:38:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:38:34,425][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:38:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:38:35,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:38:36,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:38:37,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:38:37,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:38:37,592][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:38:38,847][__main__][INFO] - Iteration 764 took 52s (10.33% Gen, 87.26% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 12m 2s. Estimated total time: 14h 29m 16s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 38s. [2026-03-26 01:38:38,850][__main__][INFO] - Starting iteration 764. [2026-03-26 01:38:38,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:38:38,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:38:44,200][__main__][INFO] - Number of regex retries in iteration 764: 0 [2026-03-26 01:38:44,201][__main__][INFO] - agents played in iteration 764 are Alice, Bob [2026-03-26 01:38:44,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:38:44,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:38:44,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:38:44,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:38:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:38:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:38:46,948][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:38:47,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:38:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:38:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:38:49,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:38:50,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:38:50,896][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:38:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:38:52,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:38:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:38:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:38:54,187][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:38:54,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:38:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:38:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:38:56,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:38:57,479][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:38:58,137][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:38:58,795][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:38:59,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:39:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:39:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:39:01,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:39:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:39:02,744][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:39:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:39:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:39:04,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:39:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:39:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:39:06,694][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:39:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:39:08,011][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:39:08,669][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:39:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:39:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:39:10,644][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:39:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:39:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:39:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:39:13,273][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:39:13,930][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:39:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:39:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:39:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:39:16,560][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:39:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:39:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:39:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:39:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:39:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:39:20,760][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:39:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:39:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:39:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:39:23,389][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:39:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:39:24,704][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:39:25,361][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:39:26,021][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:39:26,677][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:39:27,335][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:39:27,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:39:28,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:39:29,765][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:39:29,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:39:29,769][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:39:31,074][__main__][INFO] - Iteration 765 took 52s (10.24% Gen, 87.26% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 12m 15s. Estimated total time: 14h 30m 22s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 11s. [2026-03-26 01:39:31,077][__main__][INFO] - Starting iteration 765. [2026-03-26 01:39:31,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:39:31,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:39:35,925][__main__][INFO] - Number of regex retries in iteration 765: 0 [2026-03-26 01:39:35,927][__main__][INFO] - agents played in iteration 765 are Alice, Bob [2026-03-26 01:39:36,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:39:36,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:39:36,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:39:36,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:39:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:39:37,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:39:38,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:39:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:39:39,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:39:40,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:39:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:39:41,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:39:42,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:39:43,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:39:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:39:44,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:39:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:39:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:39:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:39:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:39:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:39:48,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:39:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:39:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:39:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:39:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:39:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:39:52,259][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:39:52,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:39:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:39:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:39:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:39:55,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:39:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:39:56,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:39:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:39:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:39:58,833][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:39:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:40:00,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:40:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:40:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:40:02,119][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:40:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:40:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:40:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:40:04,747][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:40:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:40:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:40:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:40:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:40:08,037][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:40:08,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:40:09,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:40:10,251][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:40:10,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:40:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:40:12,223][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:40:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:40:13,538][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:40:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:40:14,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:40:15,510][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:40:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:40:16,824][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:40:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:40:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:40:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:40:19,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:40:20,154][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:40:21,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:40:21,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:40:21,376][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:40:22,796][__main__][INFO] - Iteration 766 took 51s (9.37% Gen, 87.88% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 2m 57s. Estimated total time: 14h 21m 56s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 58s. [2026-03-26 01:40:22,798][__main__][INFO] - Starting iteration 766. [2026-03-26 01:40:22,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:40:22,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:40:28,015][__main__][INFO] - Number of regex retries in iteration 766: 0 [2026-03-26 01:40:28,016][__main__][INFO] - agents played in iteration 766 are Alice, Bob [2026-03-26 01:40:28,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:40:28,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:40:28,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:40:28,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:40:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:40:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:40:30,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:40:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:40:31,932][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:40:32,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:40:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:40:33,903][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:40:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:40:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:40:35,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:40:36,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:40:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:40:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:40:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:40:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:40:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:40:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:40:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:40:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:40:42,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:40:43,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:40:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:40:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:40:45,076][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:40:45,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:40:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:40:47,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:40:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:40:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:40:49,021][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:40:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:40:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:40:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:40:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:40:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:40:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:40:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:40:54,282][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:40:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:40:55,598][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:40:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:40:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:40:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:40:58,232][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:40:58,890][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:40:59,549][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:41:00,208][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:41:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:41:01,772][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:41:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:41:03,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:41:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:41:04,406][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:41:05,064][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:41:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:41:06,381][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:41:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:41:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:41:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:41:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:41:09,672][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:41:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:41:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:41:11,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:41:12,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:41:13,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:41:13,414][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:41:13,415][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:41:14,606][__main__][INFO] - Iteration 767 took 51s (10.06% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 3m 34s. Estimated total time: 14h 23m 25s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 20s, 500 more iterations: 7h 11m 42s. [2026-03-26 01:41:14,608][__main__][INFO] - Starting iteration 767. [2026-03-26 01:41:14,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:41:14,613][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:41:19,398][__main__][INFO] - Number of regex retries in iteration 767: 0 [2026-03-26 01:41:19,399][__main__][INFO] - agents played in iteration 767 are Alice, Bob [2026-03-26 01:41:20,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:41:20,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:41:20,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:41:20,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:41:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:41:21,337][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:41:21,996][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:41:22,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:41:23,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:41:23,968][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:41:24,626][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:41:25,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:41:25,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:41:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:41:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:41:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:41:28,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:41:29,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:41:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:41:30,547][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:41:31,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:41:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:41:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:41:33,181][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:41:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:41:34,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:41:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:41:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:41:36,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:41:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:41:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:41:38,447][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:41:39,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:41:39,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:41:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:41:41,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:41:41,740][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:41:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:41:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:41:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:41:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:41:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:41:45,688][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:41:46,347][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:41:47,005][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:41:47,664][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:41:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:41:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:41:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:41:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:41:50,955][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:41:51,613][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:41:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:41:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:41:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:41:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:41:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:41:55,804][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:41:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:41:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:41:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:41:58,442][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:41:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:41:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:42:00,420][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:42:01,079][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:42:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:42:02,399][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:42:03,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:42:03,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:42:05,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:42:05,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:42:05,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:42:06,275][__main__][INFO] - Iteration 768 took 51s (9.26% Gen, 88.41% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 0m 22s. Estimated total time: 14h 21m 4s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 32s. [2026-03-26 01:42:06,277][__main__][INFO] - Starting iteration 768. [2026-03-26 01:42:06,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:42:06,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:42:11,391][__main__][INFO] - Number of regex retries in iteration 768: 0 [2026-03-26 01:42:11,392][__main__][INFO] - agents played in iteration 768 are Alice, Bob [2026-03-26 01:42:11,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:12,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:12,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:42:12,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:42:12,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:42:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:42:13,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:42:14,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:42:15,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:42:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:42:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:42:17,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:42:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:42:18,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:42:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:42:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:42:20,556][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:42:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:42:21,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:42:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:42:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:42:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:42:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:42:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:42:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:42:26,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:42:27,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:42:27,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:42:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:42:29,120][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:42:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:42:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:42:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:42:31,754][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:42:32,413][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:42:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:42:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:42:34,391][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:42:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:42:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:42:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:42:37,026][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:42:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:42:38,344][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:42:39,003][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:42:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:42:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:42:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:42:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:42:42,299][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:42:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:42:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:42:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:42:45,191][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:42:45,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:42:46,506][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:42:47,165][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:42:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:42:48,481][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:42:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:42:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:42:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:42:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:42:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:42:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:42:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:42:53,748][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:42:54,406][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:42:55,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:42:55,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:42:57,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:42:57,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:42:57,016][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:42:58,208][__main__][INFO] - Iteration 769 took 51s (9.84% Gen, 87.86% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 3m 54s. Estimated total time: 14h 25m 28s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 44s. [2026-03-26 01:42:58,210][__main__][INFO] - Starting iteration 769. [2026-03-26 01:42:58,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:42:58,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:43:02,873][__main__][INFO] - Number of regex retries in iteration 769: 0 [2026-03-26 01:43:02,875][__main__][INFO] - agents played in iteration 769 are Alice, Bob [2026-03-26 01:43:03,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:03,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:03,526][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:43:03,527][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:43:04,172][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:43:04,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:43:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:43:06,106][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:43:06,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:43:07,423][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:43:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:43:08,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:43:09,398][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:43:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:43:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:43:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:43:12,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:43:12,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:43:13,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:43:14,010][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:43:14,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:43:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:43:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:43:16,644][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:43:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:43:17,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:43:18,622][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:43:19,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:43:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:43:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:43:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:43:21,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:43:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:43:23,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:43:23,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:43:24,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:43:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:43:25,870][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:43:26,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:43:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:43:27,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:43:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:43:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:43:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:43:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:43:31,140][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:43:31,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:43:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:43:33,117][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:43:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:43:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:43:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:43:35,995][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:43:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:43:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:43:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:43:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:43:39,291][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:43:39,949][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:43:40,608][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:43:41,268][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:43:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:43:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:43:43,244][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:43:43,904][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:43:44,563][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:43:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:43:45,880][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:43:46,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:43:47,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:43:48,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:43:48,473][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:43:48,474][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:43:49,755][__main__][INFO] - Iteration 770 took 51s (9.04% Gen, 88.47% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 56m 36s. Estimated total time: 14h 19m 1s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 54s, 500 more iterations: 7h 9m 30s. [2026-03-26 01:43:49,757][__main__][INFO] - Starting iteration 770. [2026-03-26 01:43:49,762][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:43:49,763][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:43:55,946][__main__][INFO] - Number of regex retries in iteration 770: 0 [2026-03-26 01:43:55,948][__main__][INFO] - agents played in iteration 770 are Alice, Bob [2026-03-26 01:43:56,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:56,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:56,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:43:56,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:43:57,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:43:57,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:43:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:43:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:43:59,848][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:44:00,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:44:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:44:01,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:44:02,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:44:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:44:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:44:04,459][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:44:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:44:05,777][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:44:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:44:07,093][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:44:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:44:08,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:44:09,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:44:09,728][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:44:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:44:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:44:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:44:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:44:13,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:44:13,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:44:14,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:44:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:44:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:44:16,315][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:44:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:44:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:44:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:44:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:44:19,609][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:44:20,268][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:44:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:44:21,586][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:44:22,245][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:44:22,904][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:44:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:44:24,222][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:44:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:44:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:44:26,199][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:44:26,858][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:44:27,516][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:44:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:44:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:44:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:44:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:44:31,066][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:44:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:44:32,384][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:44:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:44:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:44:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:44:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:44:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:44:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:44:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:44:37,657][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:44:38,316][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:44:38,975][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:44:39,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:44:40,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:44:41,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:44:41,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:44:41,435][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:44:42,629][__main__][INFO] - Iteration 771 took 52s (11.70% Gen, 86.04% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 17m 51s. Estimated total time: 14h 41m 9s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 6s, 500 more iterations: 7h 20m 34s. [2026-03-26 01:44:42,632][__main__][INFO] - Starting iteration 771. [2026-03-26 01:44:42,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:44:42,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:44:47,449][__main__][INFO] - Number of regex retries in iteration 771: 0 [2026-03-26 01:44:47,451][__main__][INFO] - agents played in iteration 771 are Alice, Bob [2026-03-26 01:44:47,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:44:48,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:44:48,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:44:48,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:44:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:44:49,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:44:49,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:44:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:44:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:44:51,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:44:52,599][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:44:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:44:53,915][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:44:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:44:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:44:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:44:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:44:57,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:44:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:44:58,524][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:44:59,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:44:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:45:00,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:45:01,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:45:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:45:02,473][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:45:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:45:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:45:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:45:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:45:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:45:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:45:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:45:07,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:45:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:45:09,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:45:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:45:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:45:11,033][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:45:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:45:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:45:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:45:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:45:14,326][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:45:14,984][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:45:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:45:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:45:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:45:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:45:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:45:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:45:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:45:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:45:21,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:45:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:45:22,463][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:45:23,122][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:45:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:45:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:45:25,097][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:45:25,756][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:45:26,414][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:45:27,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:45:27,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:45:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:45:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:45:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:45:30,364][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:45:31,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:45:31,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:45:32,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:45:32,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:45:32,931][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:45:34,208][__main__][INFO] - Iteration 772 took 51s (9.33% Gen, 88.18% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 55m 23s. Estimated total time: 14h 19m 34s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 57s, 500 more iterations: 7h 9m 47s. [2026-03-26 01:45:34,211][__main__][INFO] - Starting iteration 772. [2026-03-26 01:45:34,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:45:34,217][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:45:39,162][__main__][INFO] - Number of regex retries in iteration 772: 0 [2026-03-26 01:45:39,164][__main__][INFO] - agents played in iteration 772 are Alice, Bob [2026-03-26 01:45:39,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:45:39,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:45:39,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:45:39,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:45:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:45:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:45:41,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:45:42,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:45:43,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:45:43,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:45:44,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:45:45,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:45:45,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:45:46,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:45:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:45:47,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:45:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:45:48,962][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:45:49,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:45:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:45:50,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:45:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:45:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:45:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:45:53,568][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:45:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:45:54,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:45:55,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:45:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:45:56,861][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:45:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:45:58,179][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:45:58,837][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:45:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:46:00,153][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:46:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:46:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:46:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:46:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:46:03,445][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:46:04,103][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:46:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:46:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:46:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:46:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:46:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:46:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:46:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:46:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:46:10,032][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:46:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:46:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:46:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:46:12,979][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:46:13,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:46:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:46:14,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:46:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:46:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:46:16,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:46:17,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:46:18,251][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:46:18,912][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:46:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:46:20,231][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:46:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:46:21,549][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:46:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:46:22,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:46:23,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:46:24,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:46:24,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:46:24,938][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:46:26,337][__main__][INFO] - Iteration 773 took 52s (9.49% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 3m 41s. Estimated total time: 14h 28m 43s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 21s. [2026-03-26 01:46:26,339][__main__][INFO] - Starting iteration 773. [2026-03-26 01:46:26,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:46:26,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:46:31,448][__main__][INFO] - Number of regex retries in iteration 773: 0 [2026-03-26 01:46:31,449][__main__][INFO] - agents played in iteration 773 are Alice, Bob [2026-03-26 01:46:32,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:46:32,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:46:32,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:46:32,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:46:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:46:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:46:34,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:46:34,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:46:35,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:46:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:46:36,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:46:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:46:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:46:38,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:46:39,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:46:39,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:46:40,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:46:41,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:46:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:46:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:46:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:46:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:46:44,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:46:45,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:46:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:46:46,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:46:47,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:46:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:46:48,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:46:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:46:49,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:46:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:46:51,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:46:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:46:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:46:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:46:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:46:54,425][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:46:55,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:46:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:46:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:46:57,062][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:46:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:46:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:46:59,039][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:46:59,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:47:00,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:47:01,015][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:47:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:47:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:47:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:47:03,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:47:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:47:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:47:05,959][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:47:06,619][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:47:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:47:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:47:08,594][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:47:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:47:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:47:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:47:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:47:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:47:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:47:13,202][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:47:13,861][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:47:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:47:15,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:47:15,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:47:17,095][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:47:17,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:47:17,099][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:47:18,350][__main__][INFO] - Iteration 774 took 52s (9.81% Gen, 87.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 0m 54s. Estimated total time: 14h 26m 48s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 24s. [2026-03-26 01:47:18,353][__main__][INFO] - Starting iteration 774. [2026-03-26 01:47:18,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:47:18,358][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:47:23,453][__main__][INFO] - Number of regex retries in iteration 774: 0 [2026-03-26 01:47:23,454][__main__][INFO] - agents played in iteration 774 are Alice, Bob [2026-03-26 01:47:23,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:47:24,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:47:24,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:47:24,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:47:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:47:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:47:25,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:47:26,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:47:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:47:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:47:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:47:29,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:47:29,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:47:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:47:31,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:47:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:47:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:47:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:47:33,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:47:34,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:47:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:47:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:47:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:47:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:47:37,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:47:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:47:39,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:47:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:47:40,442][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:47:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:47:41,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:47:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:47:43,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:47:43,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:47:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:47:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:47:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:47:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:47:47,055][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:47:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:47:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:47:49,032][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:47:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:47:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:47:51,010][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:47:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:47:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:47:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:47:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:47:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:47:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:47:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:47:56,634][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:47:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:47:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:47:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:47:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:47:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:48:00,590][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:48:01,248][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:48:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:48:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:48:03,224][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:48:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:48:04,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:48:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:48:05,963][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:48:06,632][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:48:07,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:48:08,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:48:09,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:48:09,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:48:09,238][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:48:10,573][__main__][INFO] - Iteration 775 took 52s (9.76% Gen, 87.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 3m 32s. Estimated total time: 14h 30m 18s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 9s. [2026-03-26 01:48:10,583][__main__][INFO] - Starting iteration 775. [2026-03-26 01:48:10,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:48:10,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:48:15,550][__main__][INFO] - Number of regex retries in iteration 775: 0 [2026-03-26 01:48:15,552][__main__][INFO] - agents played in iteration 775 are Alice, Bob [2026-03-26 01:48:16,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:48:16,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:48:16,282][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:48:16,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:48:16,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:48:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:48:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:48:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:48:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:48:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:48:20,841][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:48:21,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:48:22,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:48:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:48:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:48:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:48:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:48:25,452][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:48:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:48:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:48:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:48:28,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:48:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:48:29,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:48:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:48:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:48:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:48:32,036][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:48:32,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:48:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:48:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:48:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:48:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:48:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:48:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:48:37,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:48:37,962][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:48:38,620][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:48:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:48:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:48:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:48:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:48:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:48:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:48:43,228][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:48:43,887][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:48:44,545][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:48:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:48:45,862][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:48:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:48:47,179][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:48:47,837][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:48:48,825][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:48:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:48:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:48:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:48:51,459][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:48:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:48:52,776][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:48:53,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:48:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:48:54,750][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:48:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:48:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:48:56,727][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:48:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:48:58,044][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:48:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:48:59,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:49:00,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:49:01,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:49:01,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:49:01,127][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:49:02,521][__main__][INFO] - Iteration 776 took 51s (9.51% Gen, 87.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 57m 29s. Estimated total time: 14h 25m 7s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 33s. [2026-03-26 01:49:02,523][__main__][INFO] - Starting iteration 776. [2026-03-26 01:49:02,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:49:02,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:49:07,762][__main__][INFO] - Number of regex retries in iteration 776: 0 [2026-03-26 01:49:07,763][__main__][INFO] - agents played in iteration 776 are Alice, Bob [2026-03-26 01:49:08,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:49:08,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:49:08,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:49:08,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:49:08,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:49:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:49:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:49:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:49:11,574][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:49:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:49:12,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:49:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:49:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:49:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:49:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:49:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:49:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:49:17,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:49:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:49:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:49:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:49:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:49:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:49:21,444][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:49:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:49:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:49:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:49:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:49:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:49:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:49:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:49:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:49:27,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:49:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:49:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:49:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:49:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:49:30,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:49:31,320][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:49:31,979][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:49:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:49:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:49:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:49:34,613][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:49:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:49:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:49:36,586][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:49:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:49:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:49:38,561][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:49:39,219][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:49:39,878][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:49:40,797][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:49:41,456][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:49:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:49:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:49:43,430][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:49:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:49:44,747][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:49:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:49:46,063][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:49:46,721][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:49:47,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:49:48,038][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:49:48,697][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:49:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:49:50,014][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:49:50,672][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:49:51,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:49:52,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:49:53,276][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:49:53,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:49:53,280][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:49:54,708][__main__][INFO] - Iteration 777 took 52s (10.03% Gen, 87.23% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 1m 12s. Estimated total time: 14h 29m 42s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 51s. [2026-03-26 01:49:54,710][__main__][INFO] - Starting iteration 777. [2026-03-26 01:49:54,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:49:54,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:49:59,521][__main__][INFO] - Number of regex retries in iteration 777: 0 [2026-03-26 01:49:59,523][__main__][INFO] - agents played in iteration 777 are Alice, Bob [2026-03-26 01:50:00,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:00,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:00,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:50:00,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:50:00,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:50:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:50:02,156][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:50:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:50:03,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:50:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:50:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:50:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:50:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:50:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:50:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:50:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:50:08,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:50:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:50:10,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:50:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:50:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:50:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:50:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:50:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:50:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:50:14,665][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:50:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:50:16,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:50:16,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:50:17,464][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:50:18,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:50:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:50:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:50:20,097][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:50:20,755][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:50:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:50:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:50:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:50:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:50:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:50:24,704][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:50:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:50:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:50:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:50:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:50:27,994][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:50:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:50:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:50:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:50:30,626][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:50:31,284][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:50:31,942][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:50:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:50:33,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:50:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:50:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:50:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:50:36,147][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:50:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:50:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:50:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:50:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:50:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:50:40,096][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:50:40,753][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:50:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:50:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:50:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:50:43,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:50:44,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:50:45,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:50:45,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:50:45,487][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:50:46,730][__main__][INFO] - Iteration 778 took 52s (9.24% Gen, 88.36% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 57m 34s. Estimated total time: 14h 26m 57s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 28s. [2026-03-26 01:50:46,734][__main__][INFO] - Starting iteration 778. [2026-03-26 01:50:46,747][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:50:46,748][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:50:51,621][__main__][INFO] - Number of regex retries in iteration 778: 0 [2026-03-26 01:50:51,622][__main__][INFO] - agents played in iteration 778 are Alice, Bob [2026-03-26 01:50:52,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:52,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:52,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:50:52,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:50:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:50:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:50:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:50:54,885][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:50:55,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:50:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:50:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:50:57,516][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:50:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:50:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:50:59,489][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:51:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:51:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:51:01,464][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:51:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:51:02,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:51:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:51:04,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:51:04,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:51:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:51:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:51:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:51:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:51:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:51:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:51:09,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:51:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:51:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:51:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:51:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:51:12,650][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:51:13,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:51:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:51:14,625][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:51:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:51:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:51:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:51:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:51:17,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:51:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:51:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:51:19,892][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:51:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:51:21,209][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:51:21,868][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:51:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:51:23,185][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:51:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:51:24,726][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:51:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:51:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:51:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:51:27,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:51:28,018][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:51:28,676][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:51:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:51:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:51:30,652][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:51:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:51:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:51:32,627][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:51:33,285][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:51:33,944][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:51:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:51:35,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:51:35,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:51:37,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:51:37,111][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:51:37,113][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:51:38,258][__main__][INFO] - Iteration 779 took 51s (9.46% Gen, 88.31% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 48m 17s. Estimated total time: 14h 18m 31s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 51s, 500 more iterations: 7h 9m 15s. [2026-03-26 01:51:38,260][__main__][INFO] - Starting iteration 779. [2026-03-26 01:51:38,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:51:38,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:51:43,350][__main__][INFO] - Number of regex retries in iteration 779: 0 [2026-03-26 01:51:43,351][__main__][INFO] - agents played in iteration 779 are Alice, Bob [2026-03-26 01:51:43,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:51:43,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:51:43,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:51:43,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:51:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:51:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:51:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:51:46,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:51:47,223][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:51:47,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:51:48,537][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:51:49,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:51:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:51:50,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:51:51,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:51:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:51:52,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:51:53,135][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:51:53,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:51:54,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:51:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:51:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:51:56,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:51:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:51:57,739][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:51:58,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:51:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:51:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:52:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:52:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:52:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:52:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:52:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:52:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:52:04,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:52:04,970][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:52:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:52:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:52:06,943][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:52:07,600][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:52:08,258][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:52:08,915][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:52:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:52:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:52:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:52:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:52:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:52:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:52:13,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:52:14,177][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:52:14,834][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:52:15,492][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:52:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:52:17,065][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:52:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:52:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:52:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:52:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:52:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:52:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:52:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:52:22,323][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:52:22,980][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:52:23,638][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:52:24,296][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:52:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:52:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:52:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:52:26,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:52:27,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:52:28,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:52:28,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:52:28,786][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:52:30,139][__main__][INFO] - Iteration 780 took 51s (9.80% Gen, 87.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 53m 30s. Estimated total time: 14h 24m 36s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 27s, 500 more iterations: 7h 12m 18s. [2026-03-26 01:52:30,141][__main__][INFO] - Starting iteration 780. [2026-03-26 01:52:30,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:52:30,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:52:34,989][__main__][INFO] - Number of regex retries in iteration 780: 0 [2026-03-26 01:52:34,990][__main__][INFO] - agents played in iteration 780 are Alice, Bob [2026-03-26 01:52:35,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:52:35,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:52:35,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:52:35,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:52:36,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:52:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:52:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:52:38,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:52:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:52:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:52:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:52:40,873][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:52:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:52:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:52:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:52:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:52:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:52:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:52:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:52:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:52:46,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:52:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:52:48,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:52:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:52:49,417][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:52:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:52:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:52:51,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:52:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:52:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:52:53,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:52:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:52:54,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:52:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:52:55,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:52:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:52:57,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:52:57,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:52:58,621][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:52:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:53:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:53:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:53:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:53:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:53:02,758][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:53:03,415][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:53:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:53:04,730][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:53:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:53:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:53:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:53:07,360][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:53:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:53:08,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:53:09,549][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:53:10,207][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:53:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:53:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:53:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:53:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:53:13,494][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:53:14,152][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:53:14,809][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:53:15,467][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:53:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:53:16,781][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:53:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:53:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:53:18,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:53:19,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:53:20,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:53:20,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:53:20,541][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:53:21,926][__main__][INFO] - Iteration 781 took 51s (9.36% Gen, 87.97% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 51m 4s. Estimated total time: 14h 23m 2s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 18s, 500 more iterations: 7h 11m 31s. [2026-03-26 01:53:21,928][__main__][INFO] - Starting iteration 781. [2026-03-26 01:53:21,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:53:21,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:53:27,324][__main__][INFO] - Number of regex retries in iteration 781: 0 [2026-03-26 01:53:27,326][__main__][INFO] - agents played in iteration 781 are Alice, Bob [2026-03-26 01:53:27,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:53:27,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:53:27,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:53:27,906][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:53:28,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:53:29,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:53:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:53:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:53:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:53:31,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:53:32,469][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:53:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:53:33,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:53:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:53:35,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:53:35,754][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:53:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:53:37,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:53:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:53:38,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:53:39,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:53:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:53:40,355][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:53:41,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:53:41,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:53:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:53:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:53:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:53:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:53:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:53:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:53:46,269][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:53:46,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:53:47,584][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:53:48,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:53:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:53:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:53:50,213][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:53:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:53:51,527][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:53:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:53:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:53:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:53:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:53:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:53:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:53:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:53:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:53:57,443][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:53:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:53:58,758][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:53:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:54:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:54:01,003][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:54:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:54:02,317][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:54:02,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:54:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:54:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:54:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:54:05,604][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:54:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:54:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:54:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:54:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:54:08,892][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:54:09,549][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:54:10,207][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:54:10,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:54:11,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:54:12,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:54:12,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:54:12,665][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:54:14,006][__main__][INFO] - Iteration 782 took 52s (10.35% Gen, 87.06% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 55m 6s. Estimated total time: 14h 27m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 58s. [2026-03-26 01:54:14,009][__main__][INFO] - Starting iteration 782. [2026-03-26 01:54:14,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:54:14,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:54:18,776][__main__][INFO] - Number of regex retries in iteration 782: 0 [2026-03-26 01:54:18,777][__main__][INFO] - agents played in iteration 782 are Alice, Bob [2026-03-26 01:54:19,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:54:19,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:54:19,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:54:19,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:54:20,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:54:20,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:54:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:54:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:54:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:54:23,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:54:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:54:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:54:25,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:54:25,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:54:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:54:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:54:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:54:28,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:54:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:54:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:54:30,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:54:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:54:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:54:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:54:33,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:54:33,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:54:34,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:54:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:54:35,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:54:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:54:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:54:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:54:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:54:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:54:39,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:54:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:54:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:54:41,721][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:54:42,379][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:54:43,036][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:54:43,693][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:54:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:54:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:54:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:54:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:54:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:54:47,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:54:48,295][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:54:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:54:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:54:50,266][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:54:50,923][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:54:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:54:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:54:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:54:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:54:54,437][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:54:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:54:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:54:56,410][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:54:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:54:57,725][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:54:58,382][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:54:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:54:59,697][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:55:00,354][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:55:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:55:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:55:02,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:55:03,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:55:04,119][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:55:04,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:55:04,123][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:55:05,508][__main__][INFO] - Iteration 783 took 51s (9.25% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 44m 35s. Estimated total time: 14h 18m 17s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 49s, 500 more iterations: 7h 9m 8s. [2026-03-26 01:55:05,511][__main__][INFO] - Starting iteration 783. [2026-03-26 01:55:05,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:55:05,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:55:10,792][__main__][INFO] - Number of regex retries in iteration 783: 0 [2026-03-26 01:55:10,793][__main__][INFO] - agents played in iteration 783 are Alice, Bob [2026-03-26 01:55:11,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:55:11,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:55:11,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:55:11,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:55:12,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:55:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:55:13,365][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:55:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:55:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:55:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:55:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:55:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:55:17,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:55:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:55:18,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:55:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:55:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:55:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:55:21,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:55:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:55:22,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:55:23,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:55:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:55:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:55:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:55:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:55:26,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:55:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:55:27,822][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:55:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:55:29,136][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:55:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:55:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:55:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:55:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:55:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:55:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:55:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:55:34,394][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:55:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:55:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:55:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:55:37,023][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:55:37,680][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:55:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:55:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:55:39,653][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:55:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:55:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:55:41,625][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:55:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:55:42,939][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:55:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:55:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:55:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:55:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:55:46,531][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:55:47,188][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:55:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:55:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:55:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:55:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:55:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:55:51,132][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:55:51,789][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:55:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:55:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:55:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:55:54,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:55:55,082][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:55:56,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:55:56,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:55:56,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:55:57,570][__main__][INFO] - Iteration 784 took 52s (10.14% Gen, 87.23% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 53m 3s. Estimated total time: 14h 27m 36s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 48s. [2026-03-26 01:55:57,572][__main__][INFO] - Starting iteration 784. [2026-03-26 01:55:57,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:55:57,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:56:03,094][__main__][INFO] - Number of regex retries in iteration 784: 0 [2026-03-26 01:56:03,095][__main__][INFO] - agents played in iteration 784 are Alice, Bob [2026-03-26 01:56:03,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:03,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:03,726][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:56:03,727][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:56:04,372][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:56:04,985][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:56:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:56:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:56:06,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:56:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:56:08,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:56:08,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:56:09,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:56:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:56:10,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:56:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:56:12,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:56:12,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:56:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:56:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:56:14,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:56:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:56:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:56:16,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:56:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:56:18,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:56:18,784][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:56:19,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:56:20,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:56:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:56:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:56:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:56:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:56:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:56:24,043][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:56:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:56:25,358][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:56:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:56:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:56:27,329][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:56:27,987][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:56:28,644][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:56:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:56:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:56:30,616][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:56:31,273][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:56:31,930][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:56:32,587][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:56:33,245][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:56:33,902][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:56:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:56:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:56:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:56:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:56:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:56:38,071][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:56:38,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:56:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:56:40,044][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:56:40,702][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:56:41,360][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:56:42,019][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:56:42,677][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:56:43,336][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:56:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:56:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:56:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:56:45,969][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:56:46,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:56:47,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:56:48,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:56:48,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:56:48,450][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:56:49,766][__main__][INFO] - Iteration 785 took 52s (10.57% Gen, 86.90% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 54m 27s. Estimated total time: 14h 29m 52s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 59s, 500 more iterations: 7h 14m 56s. [2026-03-26 01:56:49,770][__main__][INFO] - Starting iteration 785. [2026-03-26 01:56:49,774][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:56:49,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:56:55,069][__main__][INFO] - Number of regex retries in iteration 785: 0 [2026-03-26 01:56:55,070][__main__][INFO] - agents played in iteration 785 are Alice, Bob [2026-03-26 01:56:55,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:55,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:55,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:56:55,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:56:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:56:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:56:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:56:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:56:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:56:59,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:57:00,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:57:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:57:01,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:57:02,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:57:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:57:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:57:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:57:04,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:57:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:57:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:57:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:57:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:57:08,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:57:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:57:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:57:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:57:10,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:57:11,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:57:12,155][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:57:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:57:13,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:57:14,131][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:57:14,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:57:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:57:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:57:16,768][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:57:17,428][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:57:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:57:18,747][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:57:19,406][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:57:20,065][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:57:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:57:21,383][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:57:22,043][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:57:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:57:23,362][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:57:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:57:24,680][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:57:25,340][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:57:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:57:26,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:57:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:57:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:57:28,890][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:57:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:57:30,208][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:57:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:57:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:57:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:57:32,845][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:57:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:57:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:57:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:57:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:57:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:57:36,801][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:57:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:57:38,120][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:57:38,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:57:39,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:57:40,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:57:40,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:57:40,579][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:57:41,986][__main__][INFO] - Iteration 786 took 52s (10.14% Gen, 87.16% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 53m 55s. Estimated total time: 14h 30m 13s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 6s. [2026-03-26 01:57:41,988][__main__][INFO] - Starting iteration 786. [2026-03-26 01:57:41,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:57:41,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:57:43,238][mllm.models.large_language_model_local][WARNING] - Response /A did not match regex: (|), retry 1/1 [2026-03-26 01:57:47,316][__main__][INFO] - Number of regex retries in iteration 786: 1 [2026-03-26 01:57:47,317][__main__][INFO] - agents played in iteration 786 are Alice, Bob [2026-03-26 01:57:47,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:57:47,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:57:47,987][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:57:47,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:57:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:57:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:57:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:57:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:57:51,233][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:57:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:57:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:57:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:57:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:57:54,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:57:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:57:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:57:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:57:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:57:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:57:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:57:59,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:57:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:58:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:58:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:58:01,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:58:02,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:58:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:58:03,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:58:04,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:58:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:58:05,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:58:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:58:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:58:07,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:58:08,368][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:58:09,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:58:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:58:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:58:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:58:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:58:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:58:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:58:13,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:58:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:58:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:58:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:58:16,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:58:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:58:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:58:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:58:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:58:19,578][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:58:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:58:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:58:21,780][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:58:22,438][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:58:23,098][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:58:23,757][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:58:24,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:58:25,077][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:58:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:58:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:58:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:58:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:58:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:58:29,037][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:58:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:58:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:58:31,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:58:31,716][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:58:32,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:58:32,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:58:32,829][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:58:34,154][__main__][INFO] - Iteration 787 took 52s (10.21% Gen, 87.25% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 52m 14s. Estimated total time: 14h 29m 24s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 42s. [2026-03-26 01:58:34,157][__main__][INFO] - Starting iteration 787. [2026-03-26 01:58:34,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:58:34,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:58:39,097][__main__][INFO] - Number of regex retries in iteration 787: 0 [2026-03-26 01:58:39,098][__main__][INFO] - agents played in iteration 787 are Alice, Bob [2026-03-26 01:58:39,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:39,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:39,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:58:39,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:58:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:58:41,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:58:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:58:42,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:58:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:58:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:58:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:58:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:58:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:58:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:58:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:58:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:58:48,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:58:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:58:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:58:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:58:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:58:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:58:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:58:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:58:53,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:58:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:58:54,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:58:55,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:58:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:58:56,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:58:57,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:58:58,127][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:58:58,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:58:59,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:59:00,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:59:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:59:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:59:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:59:02,738][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:59:03,397][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:59:04,055][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:59:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:59:05,372][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:59:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:59:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:59:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:59:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:59:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:59:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:59:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:59:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:59:11,299][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:59:12,218][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:59:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:59:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:59:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:59:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:59:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:59:16,170][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:59:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:59:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:59:18,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:59:18,805][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:59:19,464][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:59:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:59:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:59:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:59:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:59:22,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:59:23,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:59:24,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:59:24,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:59:24,636][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:59:25,806][__main__][INFO] - Iteration 788 took 51s (9.56% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 42m 44s. Estimated total time: 14h 20m 46s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 4s, 500 more iterations: 7h 10m 23s. [2026-03-26 01:59:25,808][__main__][INFO] - Starting iteration 788. [2026-03-26 01:59:25,813][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:59:25,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:59:33,251][__main__][INFO] - Number of regex retries in iteration 788: 0 [2026-03-26 01:59:33,253][__main__][INFO] - agents played in iteration 788 are Alice, Bob [2026-03-26 01:59:33,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:59:33,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:59:33,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:59:33,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:59:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:59:35,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:59:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:59:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:59:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:59:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:59:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:59:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:59:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:59:40,520][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:59:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:59:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:59:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:59:43,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:59:43,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:59:44,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:59:45,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:59:45,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:59:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:59:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:59:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:59:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:59:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:59:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:59:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:59:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:59:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:59:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:59:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:59:53,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:59:54,344][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:59:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:59:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:59:56,320][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:59:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:59:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:59:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:59:58,952][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:59:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:00:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:00:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:00:01,585][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:00:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:00:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:00:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:00:04,216][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:00:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:00:05,533][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:00:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:00:07,078][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:00:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:00:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:00:09,053][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:00:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:00:10,370][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:00:11,028][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:00:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:00:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:00:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:00:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:00:14,320][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:00:14,979][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:00:15,638][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:00:16,296][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:00:16,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:00:17,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:00:18,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:00:18,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:00:18,825][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:00:20,071][__main__][INFO] - Iteration 789 took 54s (13.71% Gen, 83.99% Train). Generation: 7s, Training: 45s. Estimated remaining time: 3h 25m 25s. Estimated total time: 15h 4m 21s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 26s, 500 more iterations: 7h 32m 10s. [2026-03-26 02:00:20,074][__main__][INFO] - Starting iteration 789. [2026-03-26 02:00:20,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:00:20,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:00:31,565][__main__][INFO] - Number of regex retries in iteration 789: 0 [2026-03-26 02:00:31,566][__main__][INFO] - agents played in iteration 789 are Alice, Bob [2026-03-26 02:00:32,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:00:32,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:00:32,204][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:00:32,204][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:00:32,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:00:33,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:00:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:00:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:00:35,429][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:00:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:00:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:00:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:00:38,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:00:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:00:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:00:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:00:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:00:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:00:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:00:42,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:00:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:00:43,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:00:44,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:00:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:00:45,938][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:00:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:00:47,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:00:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:00:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:00:49,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:00:49,880][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:00:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:00:51,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:00:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:00:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:00:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:00:53,823][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:00:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:00:55,137][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:00:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:00:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:00:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:00:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:00:58,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:00:59,081][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:00:59,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:01:00,395][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:01:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:01:01,710][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:01:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:01:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:01:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:01:04,586][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:01:05,243][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:01:05,901][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:01:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:01:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:01:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:01:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:01:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:01:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:01:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:01:11,161][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:01:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:01:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:01:13,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:01:13,791][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:01:14,448][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:01:15,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:01:15,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:01:17,025][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:01:17,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:01:17,029][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:01:18,214][__main__][INFO] - Iteration 790 took 58s (19.76% Gen, 78.20% Train). Generation: 11s, Training: 45s. Estimated remaining time: 4h 29m 4s. Estimated total time: 16h 8m 58s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 53s, 500 more iterations: 8h 4m 29s. [2026-03-26 02:01:18,217][__main__][INFO] - Starting iteration 790. [2026-03-26 02:01:18,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:01:18,221][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:01:23,035][__main__][INFO] - Number of regex retries in iteration 790: 0 [2026-03-26 02:01:23,037][__main__][INFO] - agents played in iteration 790 are Alice, Bob [2026-03-26 02:01:23,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:01:23,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:01:23,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:01:23,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:01:24,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:01:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:01:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:01:26,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:01:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:01:27,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:01:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:01:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:01:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:01:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:01:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:01:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:01:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:01:32,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:01:33,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:01:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:01:34,799][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:01:35,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:01:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:01:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:01:37,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:01:38,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:01:38,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:01:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:01:40,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:01:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:01:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:01:42,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:01:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:01:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:01:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:01:44,658][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:01:45,316][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:01:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:01:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:01:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:01:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:01:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:01:49,259][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:01:49,917][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:01:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:01:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:01:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:01:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:01:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:01:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:01:54,519][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:01:55,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:01:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:01:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:01:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:01:58,090][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:01:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:01:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:02:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:02:00,719][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:02:01,376][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:02:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:02:02,691][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:02:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:02:04,008][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:02:04,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:02:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:02:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:02:06,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:02:07,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:02:08,486][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:02:08,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:02:08,490][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:02:09,809][__main__][INFO] - Iteration 791 took 51s (9.33% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 39m 5s. Estimated total time: 14h 19m 50s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 59s, 500 more iterations: 7h 9m 55s. [2026-03-26 02:02:09,813][__main__][INFO] - Starting iteration 791. [2026-03-26 02:02:09,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:02:09,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:02:14,573][__main__][INFO] - Number of regex retries in iteration 791: 0 [2026-03-26 02:02:14,574][__main__][INFO] - agents played in iteration 791 are Alice, Bob [2026-03-26 02:02:15,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:02:15,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:02:15,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:02:15,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:02:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:02:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:02:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:02:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:02:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:02:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:02:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:02:20,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:02:20,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:02:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:02:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:02:22,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:02:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:02:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:02:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:02:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:02:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:02:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:02:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:02:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:02:28,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:02:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:02:30,199][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:02:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:02:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:02:32,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:02:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:02:33,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:02:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:02:34,806][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:02:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:02:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:02:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:02:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:02:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:02:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:02:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:02:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:02:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:02:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:02:42,050][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:02:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:02:43,367][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:02:44,025][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:02:44,684][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:02:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:02:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:02:46,659][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:02:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:02:48,240][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:02:48,899][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:02:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:02:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:02:50,875][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:02:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:02:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:02:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:02:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:02:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:02:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:02:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:02:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:02:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:02:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:02:58,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:02:58,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:02:59,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:02:59,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:02:59,953][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:03:01,391][__main__][INFO] - Iteration 792 took 51s (9.22% Gen, 87.99% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 37m 58s. Estimated total time: 14h 19m 36s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 57s, 500 more iterations: 7h 9m 48s. [2026-03-26 02:03:01,394][__main__][INFO] - Starting iteration 792. [2026-03-26 02:03:01,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:03:01,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:03:06,663][__main__][INFO] - Number of regex retries in iteration 792: 0 [2026-03-26 02:03:06,664][__main__][INFO] - agents played in iteration 792 are Alice, Bob [2026-03-26 02:03:07,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:03:07,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:03:07,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:03:07,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:03:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:03:08,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:03:09,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:03:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:03:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:03:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:03:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:03:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:03:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:03:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:03:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:03:15,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:03:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:03:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:03:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:03:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:03:18,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:03:19,147][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:03:19,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:03:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:03:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:03:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:03:22,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:03:23,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:03:23,748][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:03:24,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:03:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:03:25,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:03:26,378][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:03:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:03:27,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:03:28,351][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:03:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:03:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:03:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:03:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:03:31,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:03:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:03:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:03:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:03:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:03:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:03:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:03:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:03:36,899][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:03:37,557][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:03:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:03:38,872][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:03:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:03:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:03:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:03:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:03:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:03:43,054][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:03:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:03:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:03:45,026][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:03:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:03:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:03:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:03:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:03:48,315][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:03:48,973][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:03:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:03:50,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:03:51,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:03:52,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:03:52,172][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:03:52,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:03:53,505][__main__][INFO] - Iteration 793 took 52s (10.10% Gen, 87.34% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 45m 59s. Estimated total time: 14h 28m 28s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 50s, 500 more iterations: 7h 14m 14s. [2026-03-26 02:03:53,507][__main__][INFO] - Starting iteration 793. [2026-03-26 02:03:53,512][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:03:53,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:04:03,083][__main__][INFO] - Number of regex retries in iteration 793: 0 [2026-03-26 02:04:03,084][__main__][INFO] - agents played in iteration 793 are Alice, Bob [2026-03-26 02:04:03,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:03,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:03,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:04:03,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:04:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:04:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:04:05,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:04:06,341][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:04:06,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:04:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:04:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:04:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:04:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:04:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:04:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:04:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:04:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:04:12,920][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:04:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:04:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:04:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:04:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:04:16,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:04:16,868][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:04:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:04:18,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:04:18,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:04:19,500][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:04:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:04:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:04:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:04:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:04:22,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:04:23,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:04:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:04:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:04:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:04:26,082][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:04:26,740][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:04:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:04:28,057][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:04:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:04:29,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:04:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:04:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:04:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:04:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:04:32,663][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:04:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:04:33,980][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:04:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:04:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:04:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:04:36,865][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:04:37,524][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:04:38,182][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:04:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:04:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:04:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:04:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:04:41,473][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:04:42,131][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:04:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:04:43,447][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:04:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:04:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:04:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:04:46,081][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:04:46,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:04:47,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:04:48,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:04:48,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:04:48,654][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:04:49,843][__main__][INFO] - Iteration 794 took 56s (16.99% Gen, 80.89% Train). Generation: 9s, Training: 45s. Estimated remaining time: 3h 55m 27s. Estimated total time: 15h 38m 53s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 26s. [2026-03-26 02:04:49,846][__main__][INFO] - Starting iteration 794. [2026-03-26 02:04:49,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:04:49,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:04:55,339][__main__][INFO] - Number of regex retries in iteration 794: 0 [2026-03-26 02:04:55,339][__main__][INFO] - agents played in iteration 794 are Alice, Bob [2026-03-26 02:04:55,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:55,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:55,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:04:55,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:04:56,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:04:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:04:57,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:04:58,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:04:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:04:59,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:05:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:05:01,167][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:05:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:05:02,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:05:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:05:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:05:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:05:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:05:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:05:06,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:05:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:05:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:05:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:05:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:05:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:05:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:05:11,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:05:11,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:05:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:05:13,007][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:05:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:05:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:05:14,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:05:15,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:05:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:05:16,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:05:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:05:18,272][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:05:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:05:19,589][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:05:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:05:20,906][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:05:21,564][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:05:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:05:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:05:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:05:24,195][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:05:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:05:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:05:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:05:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:05:27,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:05:28,380][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:05:29,039][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:05:29,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:05:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:05:31,013][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:05:31,671][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:05:32,329][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:05:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:05:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:05:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:05:34,963][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:05:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:05:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:05:36,938][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:05:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:05:38,256][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:05:38,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:05:39,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:05:40,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:05:40,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:05:40,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:05:42,033][__main__][INFO] - Iteration 795 took 52s (10.52% Gen, 87.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 45m 27s. Estimated total time: 14h 29m 45s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 52s. [2026-03-26 02:05:42,035][__main__][INFO] - Starting iteration 795. [2026-03-26 02:05:42,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:05:42,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:05:47,648][__main__][INFO] - Number of regex retries in iteration 795: 0 [2026-03-26 02:05:47,649][__main__][INFO] - agents played in iteration 795 are Alice, Bob [2026-03-26 02:05:48,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:05:48,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:05:48,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:05:48,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:05:48,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:05:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:05:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:05:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:05:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:05:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:05:52,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:05:53,506][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:05:54,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:05:54,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:05:55,476][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:05:56,134][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:05:56,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:05:57,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:05:58,106][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:05:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:05:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:06:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:06:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:06:01,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:06:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:06:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:06:03,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:06:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:06:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:06:05,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:06:05,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:06:06,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:06:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:06:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:06:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:06:09,275][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:06:09,932][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:06:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:06:11,246][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:06:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:06:12,560][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:06:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:06:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:06:14,532][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:06:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:06:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:06:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:06:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:06:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:06:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:06:19,132][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:06:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:06:20,700][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:06:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:06:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:06:22,671][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:06:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:06:23,985][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:06:24,642][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:06:25,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:06:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:06:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:06:27,273][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:06:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:06:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:06:29,245][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:06:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:06:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:06:31,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:06:31,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:06:33,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:06:33,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:06:33,171][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:06:34,439][__main__][INFO] - Iteration 796 took 52s (10.70% Gen, 86.87% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 48m 11s. Estimated total time: 14h 33m 21s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 40s. [2026-03-26 02:06:34,442][__main__][INFO] - Starting iteration 796. [2026-03-26 02:06:34,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:06:34,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:06:39,818][__main__][INFO] - Number of regex retries in iteration 796: 0 [2026-03-26 02:06:39,819][__main__][INFO] - agents played in iteration 796 are Alice, Bob [2026-03-26 02:06:40,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:06:40,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:06:40,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:06:40,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:06:41,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:06:41,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:06:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:06:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:06:43,639][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:06:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:06:44,952][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:06:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:06:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:06:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:06:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:06:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:06:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:06:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:06:50,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:06:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:06:51,527][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:06:52,186][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:06:52,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:06:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:06:54,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:06:54,818][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:06:55,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:06:56,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:06:56,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:06:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:06:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:06:58,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:06:59,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:07:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:07:00,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:07:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:07:02,062][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:07:02,720][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:07:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:07:04,037][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:07:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:07:05,354][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:07:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:07:06,671][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:07:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:07:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:07:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:07:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:07:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:07:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:07:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:07:11,939][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:07:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:07:13,508][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:07:14,166][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:07:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:07:15,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:07:16,141][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:07:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:07:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:07:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:07:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:07:19,433][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:07:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:07:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:07:21,409][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:07:22,068][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:07:22,726][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:07:23,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:07:24,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:07:25,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:07:25,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:07:25,315][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:07:26,806][__main__][INFO] - Iteration 797 took 52s (10.26% Gen, 86.89% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 46m 38s. Estimated total time: 14h 32m 41s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 20s. [2026-03-26 02:07:26,808][__main__][INFO] - Starting iteration 797. [2026-03-26 02:07:26,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:07:26,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:07:31,695][__main__][INFO] - Number of regex retries in iteration 797: 0 [2026-03-26 02:07:31,697][__main__][INFO] - agents played in iteration 797 are Alice, Bob [2026-03-26 02:07:32,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:07:32,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:07:32,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:07:32,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:07:33,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:07:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:07:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:07:34,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:07:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:07:36,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:07:36,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:07:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:07:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:07:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:07:39,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:07:40,201][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:07:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:07:41,516][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:07:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:07:42,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:07:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:07:44,145][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:07:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:07:45,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:07:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:07:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:07:47,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:07:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:07:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:07:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:07:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:07:50,717][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:07:51,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:07:52,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:07:52,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:07:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:07:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:07:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:07:55,320][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:07:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:07:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:07:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:07:57,951][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:07:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:07:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:07:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:08:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:08:01,237][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:08:01,894][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:08:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:08:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:08:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:08:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:08:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:08:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:08:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:08:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:08:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:08:08,719][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:08:09,377][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:08:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:08:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:08:11,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:08:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:08:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:08:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:08:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:08:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:08:15,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:08:15,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:08:17,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:08:17,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:08:17,087][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:08:18,208][__main__][INFO] - Iteration 798 took 51s (9.50% Gen, 88.31% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 29m 43s. Estimated total time: 14h 16m 37s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 39s, 500 more iterations: 7h 8m 18s. [2026-03-26 02:08:22,368][__main__][INFO] - Starting iteration 798. [2026-03-26 02:08:22,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:08:22,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:08:27,750][__main__][INFO] - Number of regex retries in iteration 798: 0 [2026-03-26 02:08:27,754][__main__][INFO] - agents played in iteration 798 are Alice, Bob [2026-03-26 02:08:28,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:08:28,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:08:28,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:08:28,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:08:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:08:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:08:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:08:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:08:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:08:32,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:08:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:08:33,529][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:08:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:08:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:08:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:08:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:08:36,813][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:08:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:08:38,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:08:38,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:08:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:08:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:08:40,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:08:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:08:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:08:42,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:08:43,385][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:08:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:08:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:08:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:08:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:08:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:08:47,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:08:47,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:08:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:08:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:08:49,957][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:08:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:08:51,271][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:08:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:08:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:08:53,244][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:08:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:08:54,559][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:08:55,216][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:08:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:08:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:08:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:08:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:08:58,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:08:59,161][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:08:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:09:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:09:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:09:02,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:09:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:09:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:09:04,038][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:09:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:09:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:09:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:09:06,667][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:09:07,325][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:09:07,982][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:09:08,640][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:09:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:09:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:09:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:09:11,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:09:12,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:09:13,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:09:13,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:09:13,177][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:09:14,622][__main__][INFO] - Iteration 799 took 52s (10.28% Gen, 86.93% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 43m 1s. Estimated total time: 14h 30m 51s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 25s. [2026-03-26 02:09:14,625][__main__][INFO] - Starting iteration 799. [2026-03-26 02:09:14,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:09:14,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:09:19,603][__main__][INFO] - Number of regex retries in iteration 799: 0 [2026-03-26 02:09:19,605][__main__][INFO] - agents played in iteration 799 are Alice, Bob [2026-03-26 02:09:20,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:09:20,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:09:20,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:09:20,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:09:20,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:09:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:09:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:09:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:09:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:09:24,168][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:09:24,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:09:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:09:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:09:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:09:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:09:28,111][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:09:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:09:29,425][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:09:30,082][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:09:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:09:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:09:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:09:32,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:09:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:09:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:09:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:09:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:09:35,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:09:36,652][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:09:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:09:37,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:09:38,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:09:39,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:09:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:09:40,597][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:09:41,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:09:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:09:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:09:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:09:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:09:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:09:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:09:45,855][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:09:46,512][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:09:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:09:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:09:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:09:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:09:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:09:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:09:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:09:51,770][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:09:52,677][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:09:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:09:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:09:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:09:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:09:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:09:56,623][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:09:57,281][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:09:57,938][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:09:58,596][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:09:59,253][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:09:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:10:00,568][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:10:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:10:01,883][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:10:02,540][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:10:03,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:10:03,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:10:05,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:10:05,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:10:05,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:10:06,268][__main__][INFO] - Iteration 800 took 51s (9.63% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 31m 58s. Estimated total time: 14h 20m 41s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 4s, 500 more iterations: 7h 10m 20s. [2026-03-26 02:10:06,271][__main__][INFO] - Starting iteration 800. [2026-03-26 02:10:06,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:10:06,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:10:11,298][__main__][INFO] - Number of regex retries in iteration 800: 0 [2026-03-26 02:10:11,299][__main__][INFO] - agents played in iteration 800 are Alice, Bob [2026-03-26 02:10:11,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:10:11,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:10:11,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:10:11,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:10:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:10:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:10:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:10:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:10:15,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:10:15,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:10:16,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:10:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:10:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:10:18,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:10:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:10:19,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:10:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:10:21,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:10:21,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:10:22,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:10:22,997][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:10:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:10:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:10:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:10:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:10:26,284][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:10:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:10:27,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:10:28,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:10:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:10:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:10:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:10:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:10:31,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:10:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:10:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:10:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:10:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:10:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:10:35,484][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:10:36,142][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:10:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:10:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:10:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:10:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:10:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:10:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:10:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:10:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:10:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:10:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:10:43,371][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:10:44,307][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:10:44,965][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:10:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:10:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:10:46,936][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:10:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:10:48,251][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:10:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:10:49,566][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:10:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:10:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:10:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:10:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:10:52,857][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:10:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:10:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:10:54,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:10:55,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:10:56,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:10:56,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:10:56,650][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:10:59,573][__main__][INFO] - Iteration 801 took 53s (9.42% Gen, 85.09% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 58m 44s. Estimated total time: 14h 48m 19s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 49s, 500 more iterations: 7h 24m 9s. [2026-03-26 02:10:59,575][__main__][INFO] - Starting iteration 801. [2026-03-26 02:10:59,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:10:59,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:11:01,600][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2026-03-26 02:11:05,296][__main__][INFO] - Number of regex retries in iteration 801: 1 [2026-03-26 02:11:05,297][__main__][INFO] - agents played in iteration 801 are Alice, Bob [2026-03-26 02:11:05,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:11:05,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:11:05,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:11:05,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:11:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:11:07,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:11:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:11:08,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:11:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:11:09,884][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:11:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:11:11,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:11:11,855][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:11:12,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:11:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:11:13,828][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:11:14,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:11:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:11:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:11:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:11:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:11:17,777][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:11:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:11:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:11:19,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:11:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:11:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:11:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:11:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:11:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:11:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:11:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:11:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:11:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:11:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:11:27,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:11:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:11:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:11:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:11:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:11:30,294][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:11:30,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:11:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:11:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:11:32,930][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:11:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:11:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:11:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:11:35,567][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:11:36,226][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:11:36,886][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:11:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:11:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:11:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:11:39,755][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:11:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:11:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:11:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:11:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:11:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:11:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:11:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:11:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:11:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:11:46,347][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:11:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:11:47,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:11:48,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:11:48,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:11:49,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:11:50,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:11:50,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:11:50,868][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:11:52,200][__main__][INFO] - Iteration 802 took 52s (10.86% Gen, 86.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 46m 34s. Estimated total time: 14h 37m 2s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 31s. [2026-03-26 02:11:52,202][__main__][INFO] - Starting iteration 802. [2026-03-26 02:11:52,207][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:11:52,208][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:11:57,443][__main__][INFO] - Number of regex retries in iteration 802: 0 [2026-03-26 02:11:57,444][__main__][INFO] - agents played in iteration 802 are Alice, Bob [2026-03-26 02:11:57,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:11:58,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:11:58,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:11:58,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:11:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:11:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:12:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:12:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:12:01,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:12:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:12:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:12:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:12:03,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:12:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:12:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:12:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:12:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:12:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:12:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:12:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:12:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:12:09,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:12:10,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:12:11,210][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:12:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:12:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:12:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:12:13,841][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:12:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:12:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:12:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:12:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:12:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:12:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:12:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:12:19,106][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:12:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:12:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:12:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:12:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:12:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:12:23,054][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:12:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:12:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:12:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:12:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:12:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:12:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:12:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:12:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:12:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:12:29,635][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:12:30,563][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:12:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:12:31,880][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:12:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:12:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:12:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:12:34,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:12:35,170][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:12:35,828][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:12:36,486][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:12:37,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:12:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:12:38,460][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:12:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:12:39,776][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:12:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:12:41,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:12:41,841][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:12:42,946][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:12:42,948][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:12:42,951][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:12:44,240][__main__][INFO] - Iteration 803 took 52s (10.06% Gen, 87.45% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 35m 55s. Estimated total time: 14h 27m 16s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 38s. [2026-03-26 02:12:44,243][__main__][INFO] - Starting iteration 803. [2026-03-26 02:12:44,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:12:44,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:12:49,060][__main__][INFO] - Number of regex retries in iteration 803: 0 [2026-03-26 02:12:49,061][__main__][INFO] - agents played in iteration 803 are Alice, Bob [2026-03-26 02:12:49,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:12:49,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:12:49,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:12:49,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:12:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:12:50,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:12:51,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:12:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:12:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:12:53,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:12:54,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:12:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:12:55,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:12:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:12:56,913][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:12:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:12:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:12:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:12:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:13:00,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:13:00,855][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:13:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:13:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:13:02,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:13:03,484][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:13:04,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:13:04,798][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:13:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:13:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:13:06,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:13:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:13:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:13:08,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:13:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:13:10,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:13:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:13:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:13:12,029][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:13:12,687][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:13:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:13:14,001][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:13:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:13:15,316][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:13:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:13:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:13:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:13:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:13:18,603][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:13:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:13:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:13:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:13:21,234][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:13:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:13:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:13:23,442][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:13:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:13:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:13:25,414][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:13:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:13:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:13:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:13:28,046][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:13:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:13:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:13:30,020][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:13:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:13:31,336][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:13:31,993][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:13:32,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:13:33,365][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:13:34,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:13:34,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:13:34,686][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:13:36,079][__main__][INFO] - Iteration 804 took 51s (9.28% Gen, 88.02% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 31m 41s. Estimated total time: 14h 23m 53s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 56s. [2026-03-26 02:13:36,081][__main__][INFO] - Starting iteration 804. [2026-03-26 02:13:36,085][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:13:36,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:13:41,149][__main__][INFO] - Number of regex retries in iteration 804: 0 [2026-03-26 02:13:41,151][__main__][INFO] - agents played in iteration 804 are Alice, Bob [2026-03-26 02:13:41,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:13:41,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:13:41,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:13:41,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:13:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:13:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:13:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:13:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:13:45,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:13:45,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:13:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:13:47,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:13:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:13:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:13:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:13:49,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:13:50,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:13:50,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:13:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:13:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:13:52,927][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:13:53,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:13:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:13:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:13:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:13:56,214][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:13:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:13:57,529][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:13:58,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:13:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:13:59,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:14:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:14:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:14:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:14:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:14:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:14:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:14:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:14:04,768][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:14:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:14:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:14:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:14:07,401][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:14:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:14:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:14:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:14:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:14:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:14:11,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:14:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:14:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:14:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:14:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:14:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:14:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:14:16,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:14:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:14:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:14:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:14:18,840][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:14:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:14:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:14:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:14:21,473][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:14:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:14:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:14:23,447][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:14:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:14:24,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:14:25,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:14:26,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:14:26,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:14:26,706][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:14:27,952][__main__][INFO] - Iteration 805 took 51s (9.77% Gen, 87.83% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 31m 25s. Estimated total time: 14h 24m 28s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 26s, 500 more iterations: 7h 12m 14s. [2026-03-26 02:14:27,955][__main__][INFO] - Starting iteration 805. [2026-03-26 02:14:27,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:14:27,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:14:34,039][__main__][INFO] - Number of regex retries in iteration 805: 0 [2026-03-26 02:14:34,041][__main__][INFO] - agents played in iteration 805 are Alice, Bob [2026-03-26 02:14:34,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:14:34,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:14:34,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:14:34,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:14:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:14:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:14:36,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:14:37,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:14:37,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:14:38,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:14:39,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:14:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:14:40,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:14:41,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:14:41,822][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:14:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:14:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:14:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:14:44,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:14:45,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:14:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:14:46,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:14:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:14:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:14:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:14:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:14:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:14:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:14:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:14:51,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:14:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:14:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:14:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:14:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:14:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:14:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:14:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:14:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:14:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:14:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:14:58,949][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:14:59,609][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:15:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:15:00,928][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:15:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:15:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:15:02,905][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:15:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:15:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:15:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:15:05,543][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:15:06,203][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:15:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:15:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:15:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:15:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:15:09,788][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:15:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:15:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:15:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:15:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:15:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:15:13,744][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:15:14,404][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:15:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:15:15,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:15:16,383][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:15:17,042][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:15:17,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:15:18,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:15:19,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:15:19,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:15:19,653][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:15:20,964][__main__][INFO] - Iteration 806 took 53s (11.47% Gen, 86.05% Train). Generation: 6s, Training: 45s. Estimated remaining time: 2h 49m 29s. Estimated total time: 14h 43m 26s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 43s. [2026-03-26 02:15:20,966][__main__][INFO] - Starting iteration 806. [2026-03-26 02:15:20,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:15:20,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:15:26,093][__main__][INFO] - Number of regex retries in iteration 806: 0 [2026-03-26 02:15:26,095][__main__][INFO] - agents played in iteration 806 are Alice, Bob [2026-03-26 02:15:26,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:15:26,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:15:26,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:15:26,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:15:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:15:28,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:15:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:15:29,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:15:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:15:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:15:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:15:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:15:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:15:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:15:33,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:15:34,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:15:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:15:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:15:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:15:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:15:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:15:38,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:15:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:15:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:15:40,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:15:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:15:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:15:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:15:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:15:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:15:44,480][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:15:45,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:15:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:15:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:15:47,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:15:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:15:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:15:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:15:49,744][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:15:50,402][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:15:51,060][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:15:51,717][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:15:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:15:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:15:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:15:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:15:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:15:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:15:56,317][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:15:56,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:15:57,633][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:15:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:15:59,198][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:15:59,855][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:16:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:16:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:16:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:16:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:16:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:16:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:16:04,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:16:05,110][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:16:05,767][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:16:06,424][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:16:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:16:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:16:08,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:16:09,053][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:16:09,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:16:10,436][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:16:11,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:16:11,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:16:11,639][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:16:12,827][__main__][INFO] - Iteration 807 took 51s (9.88% Gen, 87.82% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 29m 29s. Estimated total time: 14h 24m 18s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 25s, 500 more iterations: 7h 12m 9s. [2026-03-26 02:16:12,829][__main__][INFO] - Starting iteration 807. [2026-03-26 02:16:12,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:16:12,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:16:17,614][__main__][INFO] - Number of regex retries in iteration 807: 0 [2026-03-26 02:16:17,615][__main__][INFO] - agents played in iteration 807 are Alice, Bob [2026-03-26 02:16:18,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:16:18,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:16:18,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:16:18,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:16:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:16:19,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:16:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:16:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:16:21,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:16:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:16:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:16:23,384][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:16:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:16:24,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:16:25,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:16:26,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:16:26,671][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:16:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:16:27,985][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:16:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:16:29,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:16:29,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:16:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:16:31,270][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:16:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:16:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:16:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:16:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:16:34,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:16:35,212][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:16:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:16:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:16:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:16:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:16:38,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:16:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:16:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:16:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:16:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:16:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:16:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:16:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:16:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:16:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:16:45,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:16:45,727][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:16:46,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:16:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:16:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:16:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:16:49,013][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:16:49,670][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:16:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:16:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:16:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:16:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:16:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:16:53,852][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:16:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:16:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:16:55,824][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:16:56,481][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:16:57,139][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:16:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:16:58,453][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:16:59,111][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:16:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:17:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:17:01,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:17:01,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:17:03,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:17:03,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:17:03,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:17:05,750][__main__][INFO] - Iteration 808 took 52s (9.04% Gen, 85.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 46m 17s. Estimated total time: 14h 41m 59s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 11s, 500 more iterations: 7h 20m 59s. [2026-03-26 02:17:05,753][__main__][INFO] - Starting iteration 808. [2026-03-26 02:17:05,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:17:05,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:17:11,168][__main__][INFO] - Number of regex retries in iteration 808: 0 [2026-03-26 02:17:11,170][__main__][INFO] - agents played in iteration 808 are Alice, Bob [2026-03-26 02:17:11,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:17:11,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:17:11,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:17:11,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:17:12,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:17:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:17:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:17:14,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:17:15,107][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:17:15,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:17:16,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:17:17,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:17:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:17:18,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:17:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:17:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:17:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:17:21,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:17:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:17:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:17:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:17:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:17:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:17:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:17:25,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:17:26,279][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:17:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:17:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:17:28,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:17:28,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:17:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:17:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:17:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:17:31,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:17:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:17:32,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:17:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:17:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:17:34,825][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:17:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:17:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:17:36,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:17:37,454][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:17:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:17:38,769][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:17:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:17:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:17:40,740][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:17:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:17:42,054][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:17:42,711][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:17:43,368][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:17:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:17:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:17:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:17:46,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:17:46,907][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:17:47,565][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:17:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:17:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:17:49,536][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:17:50,193][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:17:50,851][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:17:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:17:52,166][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:17:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:17:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:17:54,137][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:17:54,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:17:55,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 02:17:56,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:17:56,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:17:56,637][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:17:58,002][__main__][INFO] - Iteration 809 took 52s (10.36% Gen, 87.02% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 34m 12s. Estimated total time: 14h 30m 46s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 23s. [2026-03-26 02:17:58,005][__main__][INFO] - Starting iteration 809. [2026-03-26 02:17:58,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:17:58,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:18:02,738][__main__][INFO] - Number of regex retries in iteration 809: 0 [2026-03-26 02:18:02,739][__main__][INFO] - agents played in iteration 809 are Alice, Bob [2026-03-26 02:18:03,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:18:03,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:18:03,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:18:03,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:18:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:18:04,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:18:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:18:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:18:06,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:18:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:18:07,897][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:18:08,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:18:09,211][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:18:09,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:18:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:18:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:18:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:18:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:18:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:18:13,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:18:14,467][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:18:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:18:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:18:16,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:18:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:18:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:18:18,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:18:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:18:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:18:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:18:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:18:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:18:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:18:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:18:23,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:18:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:18:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:18:25,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:18:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:18:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:18:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:18:28,273][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:18:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:18:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:18:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:18:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:18:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:18:32,216][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:18:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:18:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:18:34,187][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:18:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:18:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:18:36,452][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:18:37,110][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:18:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:18:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:18:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:18:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:18:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:18:41,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:18:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:18:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:18:43,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:18:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:18:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:18:44,997][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:18:45,654][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:18:46,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:18:47,028][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:18:48,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:18:48,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:18:48,601][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:18:49,967][__main__][INFO] - Iteration 810 took 51s (9.10% Gen, 88.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 28m 34s. Estimated total time: 14h 26m 0s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 0s. [2026-03-26 02:18:49,971][__main__][INFO] - Starting iteration 810. [2026-03-26 02:18:49,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:18:49,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:18:54,891][__main__][INFO] - Number of regex retries in iteration 810: 0 [2026-03-26 02:18:54,892][__main__][INFO] - agents played in iteration 810 are Alice, Bob [2026-03-26 02:18:55,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:18:55,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:18:55,508][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:18:55,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:18:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:18:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:18:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:18:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:18:58,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:18:59,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:19:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:19:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:19:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:19:02,007][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:19:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:19:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:19:03,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:19:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:19:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:19:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:19:06,605][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:19:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:19:07,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:19:08,577][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:19:09,233][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:19:09,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:19:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:19:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:19:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:19:12,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:19:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:19:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:19:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:19:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:19:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:19:16,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:19:17,119][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:19:17,776][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:19:18,433][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:19:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:19:19,747][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:19:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:19:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:19:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:19:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:19:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:19:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:19:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:19:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:19:25,661][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:19:26,318][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:19:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:19:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:19:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:19:29,175][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:19:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:19:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:19:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:19:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:19:32,462][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:19:33,119][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:19:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:19:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:19:35,091][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:19:35,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:19:36,406][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:19:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:19:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:19:38,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:19:39,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:19:40,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:19:40,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:19:40,927][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:19:42,216][__main__][INFO] - Iteration 811 took 52s (9.41% Gen, 88.12% Train). Generation: 4s, Training: 46s. Estimated remaining time: 2h 32m 25s. Estimated total time: 14h 30m 43s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 21s. [2026-03-26 02:19:42,219][__main__][INFO] - Starting iteration 811. [2026-03-26 02:19:42,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:19:42,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:19:46,954][__main__][INFO] - Number of regex retries in iteration 811: 0 [2026-03-26 02:19:46,955][__main__][INFO] - agents played in iteration 811 are Alice, Bob [2026-03-26 02:19:47,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:19:47,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:19:47,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:19:47,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:19:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:19:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:19:49,550][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:19:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:19:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:19:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:19:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:19:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:19:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:19:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:19:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:19:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:19:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:19:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:19:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:19:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:19:58,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:19:59,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:20:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:20:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:20:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:20:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:20:02,687][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:20:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:20:04,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:20:04,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:20:05,315][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:20:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:20:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:20:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:20:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:20:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:20:09,258][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:20:09,915][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:20:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:20:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:20:11,887][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:20:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:20:13,203][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:20:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:20:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:20:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:20:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:20:16,489][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:20:17,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:20:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:20:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:20:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:20:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:20:20,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:20:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:20:22,068][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:20:22,725][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:20:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:20:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:20:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:20:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:20:26,011][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:20:26,668][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:20:27,325][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:20:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:20:28,639][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:20:29,296][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:20:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:20:30,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:20:31,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:20:32,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:20:32,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:20:32,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed42/seed_42/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:20:33,791][__main__][INFO] - Iteration 812 took 51s (9.17% Gen, 88.13% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 20m 20s. Estimated total time: 14h 19m 29s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 56s, 500 more iterations: 7h 9m 44s. [2026-03-26 02:20:33,793][__main__][INFO] - Starting iteration 812. [2026-03-26 02:20:33,798][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2026-03-26 02:20:33,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:20:38,554][__main__][INFO] - Number of regex retries in iteration 812: 0 [2026-03-26 02:20:38,555][__main__][INFO] - agents played in iteration 812 are Alice, Bob [2026-03-26 02:20:39,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:20:39,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:20:39,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:20:39,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:20:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:20:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:20:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:20:41,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:20:42,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:20:43,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:20:43,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:20:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:20:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:20:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:20:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:20:47,054][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:20:47,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:20:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:20:49,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:20:49,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:20:50,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:20:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:20:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:20:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:20:52,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:20:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:20:54,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:20:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:20:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:20:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:20:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:20:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:20:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:20:58,885][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:20:59,544][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:21:00,202][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:21:00,860][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:21:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:21:02,176][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:21:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:21:03,491][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:21:04,149][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:21:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:21:05,466][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256